Skill Determination from Long Videos

  • Hazel R Doughty

Student thesis: Doctoral ThesisDoctor of Philosophy (PhD)


Skill determination in computer vision is the problem of evaluating how well a person performs a particular task, by analysing a recorded video of that performance. Typically, skill determination from video has focussed on short tasks, where the performance of a single action is evaluated in accordance with predefined scoring metrics. This thesis is the first work to explore 'general' skill determination and demonstrates that skill can be automatically determined from video for a variety of different tasks, ranging from surgery to drawing and rolling pizza dough, using the same method. To do this, the problem is formulated as a pairwise ranking of video collections, thus skill can be determined relative to other videos in a task and does not require task-specific knowledge or scoring metrics.

In long videos, parts of the video are often irrelevant for assessing skill and there may be variability in the skill exhibited throughout a video. Therefore, it is necessary to determine skill in long videos by attending to the skill-relevant parts. This thesis thus proposes an approach to train temporal attention modules, learned with only video-level supervision, which separately attends to video parts indicative of higher and lower skill.

Learning to determining skill in each task individually limits the ability to scale to a large number of tasks, due to the training and annotation cost. This thesis explores whether there are common features for determining skill shared across different tasks. It finds that there is potential for sharing information even between seemingly unrelated tasks, however it is difficult to predict what aspects tasks will share without external knowledge.

This thesis also presents the first method to learn adverbs from instructional videos. It identifies that adverbs in the narrations of instructional videos are often skill relevant as they describe how particular actions should be performed. Using weak-supervision from adverbs in the narrations of instructional videos the method is able to learn representations shared across different actions and tasks which describe the manner in which individual actions have been performed.
Date of Award21 Jan 2021
Original languageEnglish
Awarding Institution
  • University of Bristol
SupervisorWalterio W Mayol-Cuevas (Supervisor) & Dima Damen (Supervisor)

Cite this