HD-EPIC: A Highly-Detailed Egocentric Video Dataset

Toby Perrett, Ahmad A K Dar Khalil, Saptarshi Sinha, Omar Emara, Sam J Pollard, Kranti K Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob I Chalk, Zhifan Zhu, Rhodri G L Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, Dima Damen*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

Abstract

We present a validation dataset of newly-collected kitchenbased egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HD-EPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments.

We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 37.6% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC.

HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.
Original languageEnglish
Title of host publication2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages29
Publication statusAccepted/In press - 1 Mar 2025
EventIEEE/CVF Computer Vision and Pattern Recognition: CVPR - Nashville, Nashville, United States
Duration: 11 Jun 202515 Jun 2025
https://cvpr.thecvf.com

Publication series

NameConference on Computer Vision and Pattern Recognition (CVPR)
PublisherIEEE
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Conference

ConferenceIEEE/CVF Computer Vision and Pattern Recognition
Country/TerritoryUnited States
CityNashville
Period11/06/2515/06/25
Internet address

Fingerprint

Dive into the research topics of 'HD-EPIC: A Highly-Detailed Egocentric Video Dataset'. Together they form a unique fingerprint.

Cite this