Temporal Labelling for Action Recognition in Videos

  • Davide Moltisanti

Student thesis: Doctoral ThesisDoctor of Philosophy (PhD)

Abstract

Action recognition in computer vision is the task of understanding what a subject is doing in an environment. When performing recognition in videos, labels are typically provided in the form of a category class, along with the temporal boundaries of the action.
Labelling action boundaries entails that an annotator decides when the action starts and ends. This is a subjective and arbitrary task, i.e. different people are likely to identify the start and the end of an action differently. As action boundaries vary, salient and irrelevant video frames are included or excluded, thus the ability of a classifier to learn and detect actions may be influenced. This Thesis offers an insight into how action boundaries are perceived and how they can affect classification in videos. An important finding of this study is that accurate temporal labelling is crucial to learn discriminative representations of the actions, using current state-of-the-art methods. This Thesis also proposes the Rubicon Boundaries, annotation guidelines inspired by work in cognitive psychology that aim to alleviate labelling ambiguity, in the attempt to foster more precise and consistent annotations.
Action boundaries are not only arbitrary, but also expensive to annotate. This Thesis proposes a novel level of temporal supervision for the task of action recognition, i.e. single timestamps roughly aligned with actions in untrimmed videos. Using this type of supervision, together with the proposed training algorithm, it is possible to achieve performance comparable to results obtained with full temporal supervision. The proposed method can operate under varying dataset complexity, highlighting that single timestamps constitute a good compromise between labelling effort and performance. Additionally, single timestamps also alleviate ambiguity, since annotators do not have to decide when the action starts and ends, but only to mark one frame within or close to the action.
Date of Award28 Nov 2019
Original languageEnglish
Awarding Institution
  • The University of Bristol
SupervisorDima Damen (Supervisor) & Walterio W Mayol-Cuevas (Supervisor)

Cite this

'