Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Michael Wray, Diane Larlus, Gabriela Csurka, Dima Damen

Research output: Contribution to conferenceConference Paper

129 Downloads (Pure)


We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities.

We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.
Original languageEnglish
Number of pages10
Publication statusPublished - 2 Nov 2019
EventIEEE/CVF International Conference on Computer Vision (ICCV) 2019 - Korea, Seoul
Duration: 27 Oct 20192 Nov 2019


ConferenceIEEE/CVF International Conference on Computer Vision (ICCV) 2019


Student Theses

Verbs and Me: An Investigation Into Verbs as Labels for Action Recognition in Video Understanding

Author: Wray, M., 23 Jan 2020

Supervisor: Damen, D. (Supervisor)

Student thesis: Doctoral ThesisDoctor of Philosophy (PhD)


Cite this

Wray, M., Larlus, D., Csurka, G., & Damen, D. (2019). Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. Paper presented at IEEE/CVF International Conference on Computer Vision (ICCV) 2019, Seoul, .