Audio-Visual Egocentric Action Recognition

  • Evangelos Kazakos

Student thesis: Doctoral ThesisDoctor of Philosophy (PhD)


Egocentric actions generate distinctive and varied sounds from the interactions between hands and objects. Yet, egocentric vision methods had dismissed the auditory signal and were focused on understanding object manipulations through visual reasoning solely. This thesis enhances egocentric action understanding with audio recognition capabilities by capitalising on the close proximity of the wearable sensor to the ongoing action that enables capturing crisp audio recordings.

This thesis leverages the natural synergy of vision and audio and focuses on audiovisual integration for egocentric action recognition. As actions progress at different speeds for each modality, traditional synchronous fusion approaches cannot associate misaligned discriminative moments of each modality. To unlock this potential, this thesis proposes an asynchronous fusion approach by randomly binding appearance, motion and auditory inputs within temporal windows.

The next endeavour of this thesis is to improve the audio understanding capacity of visually impaired action recognition models. Inspired by the two-stream hypothesis of the human auditory system, this thesis introduces a novel two-stream auditory architecture, where a slow stream focuses on harmonic sounds and a fast stream captures percussive sounds. This thesis also offers an investigation on four-stream architectures that fuse slow and fast visual and auditory streams and showcases the vital importance of audio-visual regularisation for training such architectures.

Finally, this thesis brings a new perspective to action recognition from untrimmed videos by showing that the current paradigm of treating each action in isolation is inefficient. Using the key insight that untrimmed videos offer well-defined sequences of actions, this thesis proposes to strengthen the understanding of actions by exploiting the temporal progression of the activity that takes place within their temporal context. To this end, this thesis introduces the notion of multimodal temporal context and proposes a model to capture the inductive biases of untrimmed videos using vision, audio and language.
Date of Award21 Jun 2022
Original languageEnglish
Awarding Institution
  • University of Bristol
SupervisorDima Damen (Supervisor)

Cite this