AbstractIn this thesis, we introduce the problem of action completion with the focus on assessing whether the goal of an action is achieved, e.g. a ball is successfully caught. We step beyond the existing research on action analysis by proposing novel approaches which assess completion of actions at both frame-level and sequence-level.
Using state-of-the-art models, designed primarily for action recognition, an incomplete sequence may still be classified as its complete counterpart due to the overlap in evidence. To investigate such goal completion, we introduce completion recognition as a sequence-level classification between complete and incomplete actions. We show that while features can perform comparably for action recognition, they may vary in their ability to recognise completion. We then propose a method which evaluates the performance of different types of features and selects the best for recognising completion. For a thorough and unbiased evaluation, we also introduce the RGB-D Action Completion (RGBD-AC) dataset which includes a balanced number of complete and incomplete sequences.
We then consider a finer-grained analysis of completion on the temporal dimension. We introduce completion detection as the problem of modelling the action's progression towards localising the moment of completion – when the action's goal is confidently considered achieved. To detect completion, we propose a supervised approach to predict frame-level labels for pre and post-completion stages. We evaluate the performance of two temporal models, namely Hidden Markov Model and Long-Short Term Memory, along with fine-tuned CNN features. As the presence of complete sequences suffices to detect completion, we extend our evaluation of completion detection to selected actions from two public action recognition datasets, i.e. HMDB and UCF101, in addition to RGBD-AC.
We then propose an approach for sequence-level completion detection using a joint classification-regression recurrent model that predicts completion from a given frame. This model is composed of frame-level recurrent voting nodes that predict the frame's relative position of the completion moment by either classification or regression. We integrate these frame-level contributions to detect a sequence-level completion moment and show that the highest performance is achieved when contributions from all frames in the sequence, whether prior or post completion, are combined.
We finally present an approach for detecting completion with weak supervision. Given sequences with weak video-level complete and incomplete labels, we learn temporal attention, along with completion prediction from all frames in the sequence. The completion moment is detected by accumulating this attention-weighted evidence. We also demonstrate how the approach can be used when completion moment supervision is available and show that temporal attention improves detection in both weakly-supervised and fully-supervised settings.
|Date of Award||23 Jun 2020|
|Supervisor||Maria Oswald (Supervisor), Dima Damen (Supervisor) & Majid Mirmehdi (Supervisor)|