With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

Vangelis Kazakos*, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen

*Corresponding author for this work

Research output: Contribution to conferenceConference Paperpeer-review

75 Downloads (Pure)


In egocentric videos, actions occur in quick succession. We capitalise on the action’s temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting
state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN.
Original languageEnglish
Number of pages24
Publication statusUnpublished - 25 Nov 2021
EventThe 32nd British Machine Vision Conference - Online
Duration: 22 Nov 202125 Nov 2021
Conference number: 32


ConferenceThe 32nd British Machine Vision Conference
Abbreviated titleBMVC 2021
Internet address


Dive into the research topics of 'With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition'. Together they form a unique fingerprint.

Cite this