Abstract
In egocentric videos, actions occur in quick succession. We capitalise on the action’s temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting
state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN.
state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN.
Original language | English |
---|---|
Number of pages | 24 |
Publication status | Unpublished - 25 Nov 2021 |
Event | The 32nd British Machine Vision Conference - Online Duration: 22 Nov 2021 → 25 Nov 2021 Conference number: 32 https://www.bmvc2021-virtualconference.com/ https://www.bmvc2021.com/ |
Conference
Conference | The 32nd British Machine Vision Conference |
---|---|
Abbreviated title | BMVC 2021 |
Period | 22/11/21 → 25/11/21 |
Internet address |