Abstract
We investigate video transforms that result in classhomogeneous label-transforms. These are video transforms that consistently maintain or modify the labels of all videos in each class. We propose a general approach to discover invariant classes, whose transformed examples maintain their label; pairs of equivariant classes, whose transformed examples exchange their labels; and novelgenerating classes, whose transformed examples belong to a new class outside the dataset. Label transforms offer additional supervision previously unexplored in video recognition benefiting data augmentation and enabling zero-shot learning opportunities by learning a class from transformed
videos of its counterpart. Amongst such video transforms, we study horizontalflipping, time-reversal, and their composition. We highlight errors in naively using horizontal-flipping as a form of data augmentation in video. Next, we validate the realism of time-reversed videos through a human perception study where people exhibit equal preference for forward and time-reversed videos. Finally, we test our approach on two datasets, Jester and Something-Something, evaluating the three video transforms for zero-shot learning and data augmentation. Our results show that gestures such as ‘zooming in’ can be learnt from ‘zooming out’ in a zero-shot setting, as well as more complex actions with state transitions such as ‘digging something out of something’ from ‘burying something in something’.
videos of its counterpart. Amongst such video transforms, we study horizontalflipping, time-reversal, and their composition. We highlight errors in naively using horizontal-flipping as a form of data augmentation in video. Next, we validate the realism of time-reversed videos through a human perception study where people exhibit equal preference for forward and time-reversed videos. Finally, we test our approach on two datasets, Jester and Something-Something, evaluating the three video transforms for zero-shot learning and data augmentation. Our results show that gestures such as ‘zooming in’ can be learnt from ‘zooming out’ in a zero-shot setting, as well as more complex actions with state transitions such as ‘digging something out of something’ from ‘burying something in something’.
Original language | English |
---|---|
Title of host publication | IEEE/CVF International Conference on Computer Vision (ICCV) 2019 |
Number of pages | 10 |
Publication status | Published - 2 Nov 2019 |
Event | IEEE/CVF International Conference on Computer Vision (ICCV) 2019 - Korea, Seoul Duration: 27 Oct 2019 → 2 Nov 2019 |
Conference
Conference | IEEE/CVF International Conference on Computer Vision (ICCV) 2019 |
---|---|
City | Seoul |
Period | 27/10/19 → 2/11/19 |