Skip to main navigation Skip to search Skip to main content

Leveraging Multimodal Data for Egocentric Video Understanding

Student thesis: Doctoral ThesisDoctor of Philosophy (PhD)

Abstract

Egocentric perception seeks to interpret the world from a first-person perspective, yet much of the research to date has focused narrowly on 2D visual data. This overlooks the rich, multimodal signals humans naturally integrate to perceive and act in the world. This thesis argues for a more holistic paradigm, asserting that a true understanding of egocentric activity requires the fusion of diverse sensory cues that humans process effortlessly.

We begin with the observation that audio and visual events often do not perfectly align in either time or semantics, yet many audio-visual methods assume such alignment---usually biased towards the visual modality. To address this, we first introduce a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of egocentric videos, accompanying the existing visual annotations and allowing each modality to be accurately represented. Building on this, we propose an audio-visual model that captures the interplay between modalities in long videos by modelling the temporal extents of audio and visual events. The model uses modality-specific time intervals as queries to a transformer encoder, which processes an untrimmed video to recognise actions within the queried intervals and modalities, leveraging both the intervals and the surrounding audio-visual context. This approach extends to action detection via dense, multi-scale interval queries, enabling fine-grained localisation of events in untrimmed footage. Notably, the model dynamically refocuses on different segments of the same input based on the time interval query, allowing it to learn from events that are temporally or semantically misaligned across modalities.

To further understand human intent, this thesis investigates the predictive power of gaze priming---where eye movements foreshadow future pick-up and placement locations of objects in 3D space. By analysing when gaze primes object interaction, we evaluate the predictive reasoning capabilities of vision-language models. Our findings reveal a significant gap: these models struggle to identify the true focus of visual attention or link anticipatory gaze to upcoming interactions, underscoring the need for systems that can proactively model human behaviour.

Finally, this thesis addresses the challenge of maintaining spatial coherence in egocentric environments by developing a model for long-term 3D multi-object tracking. Leveraging both object appearance and 3D positional data, our approach maintains persistent object identities in dynamic scenes. By synthesising diverse modalities (sound, vision, gaze, and spatial reasoning), this work presents a more predictive and human-like model of egocentric perception, moving beyond the 2D frame towards richer, embodied understanding.
Date of Award20 Jan 2026
Original languageEnglish
Awarding Institution
  • University of Bristol

Cite this

'