Abstract
In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets -- i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: this https URL
| Original language | English |
|---|---|
| Title of host publication | 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |
| Publisher | Institute of Electrical and Electronics Engineers (IEEE) |
| Publication status | Accepted/In press - 21 Feb 2026 |
| Event | The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 - Colorado Convention Center, Denver, United States Duration: 3 Jun 2026 → 7 Jun 2026 https://cvpr.thecvf.com/ |
Publication series
| Name | IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |
|---|---|
| Publisher | IEEE |
| ISSN (Electronic) | 2575-7075 |
Conference
| Conference | The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 |
|---|---|
| Abbreviated title | CVPR |
| Country/Territory | United States |
| City | Denver |
| Period | 3/06/26 → 7/06/26 |
| Internet address |
Research Groups and Themes
- Intelligent Systems Laboratory (MaVi)
Fingerprint
Dive into the research topics of 'Beyond Caption-Based Queries for Video Moment Retrieval'. Together they form a unique fingerprint.Datasets
-
HD-EPIC
Cramp, L. (Creator), Wray, M. (Creator), Perrett, T. (Creator), Chalk, J. (Creator), Flanagan, K. (Creator), Khalil, A. D. (Creator), Sinha, S. (Creator), Emara, O. (Creator), Zhu, Z. (Creator), Bansal, S. (Creator), Parida, K. (Creator), Gatti, P. (Creator), Guerrier, R. (Creator), Pollard, S. (Creator) & Abdelazim, F. (Creator), University of Bristol, 1 Jul 2014
DOI: 10.5523/bris.3cqb5b81wk2dc2379fx1mrxh47, http://data.bris.ac.uk/data/dataset/3cqb5b81wk2dc2379fx1mrxh47
Dataset
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver