Abstract
Time plays an integral role in video understanding, from disambiguating gestures like “swipe left” from “swipe right” to being the dimension over which activity unfolds. It is thus of great importance to leverage information across time in an effective manner to build successful systems for video understanding. By exploring time from three different perspectives, this thesis sheds light on which parts of video are informative, how label symmetries can be used to reduce the need for so much training data, and how activities can be explicitly modeled in the representation of video.Whilst symmetries in data have been well exploited for data augmentation purposes, the relationships among the labels used to supervise models have seen less attention. Both temporal and spatial relationships between fine-grained action labels are explored for the purposes of data augmentation and zero-shot learning.
The dominant approach for applying deep learning methods to video for recognition has been to sample a sequence of frames at regular intervals as input. However, not all frames are made equal; some contain less information than others or are redundant to frames already sampled. This thesis presents a method used to quantify the importance of frames from the perspective of a trained model. A variety of models have been proposed for video recognition that model temporal relationships in different ways, but direct comparison has often been difficult due to differences in evaluation protocol. A benchmark study of these common models on the EPIC-KITCHENS dataset under a common evaluation protocol is presented to assess the relative merits of each one.
One of the challenges faced in the video understanding community is the difficulty of scaling models up to longer video. A novel representation of video is proposed that explicitly models observed activities within a video up to a point in time to enable efficient processing of subsequent video. Additionally, a self-supervised pre-training procedure is introduced to bootstrap the model from long unlabelled videos.
Date of Award | 25 Jan 2022 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Dima Damen (Supervisor) & Walterio W Mayol-Cuevas (Supervisor) |
Keywords
- Action recognition
- Video understanding
- Computer vision
- Explainability
- Deep learning