Abstract
Temporal action detection plays a crucial role in video understanding, focusing on localizing the temporal boundaries and classifying actions in long, untrimmed videos. This thesis investigates how the contextual information of timesteps can be used to predict boundaries in both one-stage and two-stage pipelines, and how multiple modalities can be exploited to construct video representations.Intuitively, timesteps near the boundaries of an action in a video sequence should be better suited for predicting the boundary points of the action. Inspired by this, a Temporal Voting Network (TVNet) is proposed for two-stage action detection, which locates temporal boundaries by accumulating contextual evidence to predict frame-level boundary probabilities. The proposed method improves detection performance on third-person datasets THUMOS14 and ActivityNet-1.3.
Two-stage methods, such as the proposed TVNet, suffer from inflexibility in generating boundaries and the inability to be trained end-to-end. To address this issue, action detection with one-stage pipelines have been well explored. However, less attention has been given to using a measure of confidence in boundary predictions, which can lead to inaccurate boundaries.
This thesis introduces a method, Refining Action Boundaries (RAB), to incorporate the estimation of boundary confidence into one-stage anchor-free detection, through an additional prediction head that predicts the refined boundaries and more reliable confidence scores.
Based on the proposed one-stage method RAB, this thesis further leverages the natural correlation of visual and auditory information in videos, and explores different strategies to incorporate the audio modality into visual information.
An effective strategy is then introduced, which uses multi-scale cross-attention to fuse the two modalities. This thesis also proposes a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score. This leads to increased confidence for proposals that exhibit more precise boundaries. The proposed method can be integrated with other one-stage anchor-free architectures and achieves state-of-the-art performance on the egocentric EPIC-Kitchens-100 dataset, which contains numerous dense actions of varying lengths.
Date of Award | 19 Mar 2024 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Majid Mirmehdi (Supervisor), Dima Damen (Supervisor) & Toby J Perrett (Supervisor) |