ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

Adriano Fragomeni*, Michael Wray, Dima Damen

*Corresponding author for this work

Research output: Contribution to conferenceConference Paperpeer-review

11 Downloads (Pure)


In this paper, we re-examine the task of cross-modal clip-sentence retrieval,
where the clip is part of a longer untrimmed video. When the clip is short
or visually ambiguous, knowledge of its local temporal context (i.e. surrounding
video segments) can be used to improve the retrieval performance. We propose
Context Transformer (ConTra); an encoder architecture that models the interaction
between a video clip and its local temporal context in order to enhance
its embedded representations. Importantly, we supervise the context transformer
using contrastive losses in the cross-modal embedding space.
We explore context transformers for video and text modalities. Results consistently
demonstrate improved performance on three datasets: YouCook2, EPICKITCHENS
and a clip-sentence version of ActivityNet Captions. Exhaustive ablation
studies and context analysis show the efficacy of the proposed method.
Original languageEnglish
Publication statusPublished - 8 Dec 2022
EventAsian Conference on Computer Vision -
Duration: 4 Dec 20228 Dec 2022


ConferenceAsian Conference on Computer Vision
Abbreviated titleACCV


Dive into the research topics of 'ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval'. Together they form a unique fingerprint.

Cite this