Projects per year
Abstract
We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities.
We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.
We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.
Original language | English |
---|---|
Title of host publication | 2019 IEEE/CVF International Conference on Computer Vision (ICCV) |
Publisher | Institute of Electrical and Electronics Engineers (IEEE) |
Pages | 450-459 |
Number of pages | 10 |
ISBN (Electronic) | 978-1-7281-4803-8 |
DOIs | |
Publication status | Published - 2 Nov 2019 |
Event | IEEE/CVF International Conference on Computer Vision (ICCV) 2019 - Korea, Seoul Duration: 27 Oct 2019 → 2 Nov 2019 |
Publication series
Name | |
---|---|
ISSN (Electronic) | 2380-7504 |
Conference
Conference | IEEE/CVF International Conference on Computer Vision (ICCV) 2019 |
---|---|
City | Seoul |
Period | 27/10/19 → 2/11/19 |
Projects
- 1 Finished
-
LOCATE: LOcation adaptive Constrained Activity recognition using Transfer learning
4/07/16 → 3/05/18
Project: Research
Student Theses
-
Verbs and Me: An Investigation Into Verbs as Labels for Action Recognition in Video Understanding
Author: Wray, M., 23 Jan 2020Supervisor: Damen, D. (Supervisor)
Student thesis: Doctoral Thesis › Doctor of Philosophy (PhD)
File
Equipment
-
HPC (High Performance Computing) Facility
Polly E Eccleston (Other), Simon H Atack (Other) & D A G Williams (Manager)
Facility/equipment: Facility
Profiles
-
Professor Dima Damen
- Department of Computer Science - Professor in Computer Vision
- Bristol Vision Institute
- Visual Information Laboratory
Person: Academic , Member