Egocentric Audio-Visual Understanding using Approaches for Self-Supervision and Counting

Student thesis: Doctoral ThesisDoctor of Philosophy (PhD)

Abstract

This thesis explores three methods for understanding video, with a focus on egocentric data. These are utilising the audio stream alone to perform action classification, incorporating contrastive learning into supervised training to improve classification performance, and the task of repetition counting, for which we utilise the self-supervisory features of repetitive videos for both visual and audio modalities.

Audio is an inherent part of the video stream, as well as an important element in human understanding. We begin by exploring the utility of egocentric audio alone, demonstrating that audio, without any visual signal, can still be effective for predicting actions in the egocentric setting. We start with an off-the-shelf ResNet50, extending the network to include audio-specific modifications, and explore their impact. Our work is based on the EPIC-KITCHENS-100 dataset, using audio to predict verbs and nouns.

We then study the visual modality alone, utilising the untrimmed nature of many egocentric video datasets. Untrimmed videos can be very long, consisting of many shorter action clips. We exploit the presence of neighbouring clips when classifying an individual action, recognising that many neighbouring clips are easily mistaken for the clip itself in the EPIC-KITCHENS-100 dataset. We use a contrastive learning approach in combination with supervised training to separate the representations of each clip from its neighbours, improving action recognition performance.

Finally, we combine the self-supervision and audio-visual aspects of our work, applying them to the task of repetition counting. We develop a pipeline for automatically counting periods of repetition within video in a self-supervised manner, without using any ground truth. We perform in-depth analysis to demonstrate the effectiveness of our method, showing that we can accurately count repetitive video, as well as have a good understanding of how accurate our estimated count is. We demonstrate this through analysis of our estimated labels and through training models using our labels. Both video and audio can be highly repetitive; we show how our method extends across modalities, identifying where it works well and where it may fall short. Whilst not egocentric, this serves as a proof of concept for future application to egocentric data.
Date of Award9 Dec 2025
Original languageEnglish
Awarding Institution
  • University of Bristol
SupervisorTilo Burghardt (Supervisor) & Dima Damen (Supervisor)

Cite this

'