With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

Vangelis Kazakos*, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen

*Corresponding author for this work

Research output: Contribution to conferenceConference Paperpeer-review

79 Downloads (Pure)

Abstract

In egocentric videos, actions occur in quick succession. We capitalise on the action’s temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting
state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN.
Original languageEnglish
Number of pages24
Publication statusUnpublished - 25 Nov 2021
EventThe 32nd British Machine Vision Conference - Online
Duration: 22 Nov 202125 Nov 2021
Conference number: 32
https://www.bmvc2021-virtualconference.com/
https://www.bmvc2021.com/

Conference

ConferenceThe 32nd British Machine Vision Conference
Abbreviated titleBMVC 2021
Period22/11/2125/11/21
Internet address

Fingerprint

Dive into the research topics of 'With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition'. Together they form a unique fingerprint.

Cite this