Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman*, Siddhant Bansal, Zhifan Zhu, Dima Damen, Michael Wray

*Corresponding author for this work

Research output: Contribution to conferenceConference Paperpeer-review


We present Ego-Exo4D, a diverse, large-scale multi- modal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured ego- centric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form cap- tures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is un- precedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions—including a novel “expert commentary” done by coaches and teach- ers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity un- derstanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community.
Original languageEnglish
Publication statusPublished - 2024
EventIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): CVPR - Seattle, United States
Duration: 17 Jun 202421 Jun 2024


ConferenceIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Country/TerritoryUnited States
Internet address


Dive into the research topics of 'Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives'. Together they form a unique fingerprint.

Cite this