Skip to main navigation Skip to search Skip to main content

Beyond Caption-Based Queries for Video Moment Retrieval

David Pujol Perich, Albert Clapés, Dima Damen, Sergio Escalera, Michael Wray

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

Abstract

In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets -- i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: this https URL
Original languageEnglish
Title of host publication2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Publication statusAccepted/In press - 21 Feb 2026
EventThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 - Colorado Convention Center, Denver, United States
Duration: 3 Jun 20267 Jun 2026
https://cvpr.thecvf.com/

Publication series

NameIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherIEEE
ISSN (Electronic)2575-7075

Conference

ConferenceThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026
Abbreviated titleCVPR
Country/TerritoryUnited States
CityDenver
Period3/06/267/06/26
Internet address

Research Groups and Themes

  • Intelligent Systems Laboratory (MaVi)

Fingerprint

Dive into the research topics of 'Beyond Caption-Based Queries for Video Moment Retrieval'. Together they form a unique fingerprint.

Cite this