Projects per year
Abstract
In this paper, we re-examine the task of cross-modal clip-sentence retrieval,
where the clip is part of a longer untrimmed video. When the clip is short
or visually ambiguous, knowledge of its local temporal context (i.e. surrounding
video segments) can be used to improve the retrieval performance. We propose
Context Transformer (ConTra); an encoder architecture that models the interaction
between a video clip and its local temporal context in order to enhance
its embedded representations. Importantly, we supervise the context transformer
using contrastive losses in the cross-modal embedding space.
We explore context transformers for video and text modalities. Results consistently
demonstrate improved performance on three datasets: YouCook2, EPICKITCHENS
and a clip-sentence version of ActivityNet Captions. Exhaustive ablation
studies and context analysis show the efficacy of the proposed method.
where the clip is part of a longer untrimmed video. When the clip is short
or visually ambiguous, knowledge of its local temporal context (i.e. surrounding
video segments) can be used to improve the retrieval performance. We propose
Context Transformer (ConTra); an encoder architecture that models the interaction
between a video clip and its local temporal context in order to enhance
its embedded representations. Importantly, we supervise the context transformer
using contrastive losses in the cross-modal embedding space.
We explore context transformers for video and text modalities. Results consistently
demonstrate improved performance on three datasets: YouCook2, EPICKITCHENS
and a clip-sentence version of ActivityNet Captions. Exhaustive ablation
studies and context analysis show the efficacy of the proposed method.
Original language | English |
---|---|
Publication status | Published - 8 Dec 2022 |
Event | Asian Conference on Computer Vision - Hanoi, Viet Nam Duration: 4 Dec 2022 → 8 Dec 2022 https://accv2024.org/ |
Conference
Conference | Asian Conference on Computer Vision |
---|---|
Abbreviated title | ACCV |
Country/Territory | Viet Nam |
City | Hanoi |
Period | 4/12/22 → 8/12/22 |
Internet address |
Fingerprint
Dive into the research topics of 'ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval'. Together they form a unique fingerprint.-
8030 EPSRC via Oxford EP/T028572/1 Visual AI
Damen, D. (Principal Investigator)
1/12/20 → 30/11/25
Project: Research
-
UMPIRE: United Model for the Perception of Interactions for visual Recognition
Damen, D. (Principal Investigator)
1/02/20 → 31/01/25
Project: Research
Datasets
-
EPIC-KITCHENS-100
Aldamen, D. (Creator), Kazakos, E. (Creator), Doughty, H. (Creator), Munro, J. (Creator), Price, W. (Creator), Wray, M. (Creator), Perrett, T. (Creator) & Ma, J. (Creator), University of Bristol, 15 May 2020
DOI: 10.5523/bris.2g1n6qdydwa9u22shpxqzp0t8m, http://data.bris.ac.uk/data/dataset/2g1n6qdydwa9u22shpxqzp0t8m
Dataset
Equipment
-
HPC (High Performance Computing) and HTC (High Throughput Computing) Facilities
Alam, S. R. (Manager), Williams, D. A. G. (Manager), Eccleston, P. E. (Manager) & Greene, D. (Manager)
Facility/equipment: Facility