The work presented within this Thesis describes a method to extract the 3D pose of a human performing a gaited action using only the motion of a sparse set of tracked features. The features used are not extracted using specific limb detectors, but are selected and tracked automatically using a standard feature tracker. These features contain noise that is not Gaussian in nature, but systematic, as a result of edge effects. Furthermore, features are indiscriminately tracked on both the subject of interest and the background. This ensures that a method designed to exploit only the structure of the features would fail, the motion of the features must be exploited to extract useful information. A method is presented that models the expected motion of a feature tracking a particular limb. Using these models the likelihood that each observed motion is caused by a specific limb can be estimated. Furthermore, this representation allows the temporal state of an action to be estimated for each feature independently. Integrating over all features, a Hidden Markov Model is shown to be able to accurately extract gait phase. Given these initial likelihood estimates, an approach is presented to create probability maps that describe the likelihood of a limb being located at each position in the image. Following this, 2D pose in the image plane can be estimated. This is achieved using a Pictorial Structures representation that uses phase dependent priors. This search is performed via Dynamic Programming and conducted in each frame independently. Extracted poses are then enforced to be temporally coherent using a high-level motion model. Finally, these methods are shown to be suitable for 3D pose estimation. Methods are presented to extract 3D motion from 2D image observations and to efficiently extract pose in R^3. This is achieved by placing constraints on the search space and by mapping observations from the image plane into R^3. To extract 3D pose 3D models are learnt directly from motion capture data, no image data is used for training. Quantitative results are provided using the HumanEva data set. The approach is tested on a variety of scenes filmed from different viewpoints, with no prior expectation of the path the subject will walk through the scene. Despite this the same 3D model is used throughout. Unlike many current approaches, the presented method requires no initialisation of joint locations in the first frame, this is performed automatically. Furthermore, only a single viewpoint is ever exploited. The work in this thesis demonstrates that a set of sparse motion features contains enough information to extract the 3D pose of a person whilst performing gaited actions.
|Translated title of the contribution||Using Low-Level Motion for High-Level Vision|
|Publication status||Published - 2009|