Abstract
In egocentric videos, actions occur in quick succession. We capitalise on the action’s temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting
state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN.
state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN.
| Original language | English |
|---|---|
| Number of pages | 24 |
| Publication status | Unpublished - 25 Nov 2021 |
| Event | The 32nd British Machine Vision Conference - Online Duration: 22 Nov 2021 → 25 Nov 2021 Conference number: 32 https://www.bmvc2021-virtualconference.com/ https://www.bmvc2021.com/ |
Conference
| Conference | The 32nd British Machine Vision Conference |
|---|---|
| Abbreviated title | BMVC 2021 |
| Period | 22/11/21 → 25/11/21 |
| Internet address |