TY - GEN
T1 - Scaling Egocentric Vision
T2 - European Conference on Computer Vision
AU - Damen, Dima
AU - Doughty, Hazel
AU - Farinella, Giovanni
AU - Fidler, Sanja
AU - Furnari, Antonio
AU - Kazakos, Vangelis
AU - Moltisanti, Davide
AU - Munro, Jonathan
AU - Perrett, Toby
AU - Wray, Michael
PY - 2018/10/6
Y1 - 2018/10/6
N2 - First-person vision is gaining interest as it offers a unique viewpoint on people’s interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55h of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens.
AB - First-person vision is gaining interest as it offers a unique viewpoint on people’s interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55h of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens.
KW - Egocentric vision
KW - Dataset
KW - Benchmarks
KW - First-person vision
KW - Egocentric object detection
KW - Action recognition and anticipation
UR - https://arxiv.org/abs/1804.02748
U2 - 10.1007/978-3-030-01225-0_44
DO - 10.1007/978-3-030-01225-0_44
M3 - Conference Contribution (Conference Proceeding)
SN - 9783030012243
T3 - Lecture Notes in Computer Science
SP - 753
EP - 771
BT - Computer Vision – ECCV 2018
PB - Springer, Cham
Y2 - 7 September 2018 through 14 September 2018
ER -