AbstractThis thesis focuses on the task of fine-grained action understanding in videos,
specifically on the tasks of action recognition and action retrieval, with the
aim of bridging the gap between language and vision. Typically, action seg-
ments were labelled with a (small) chosen set of verbs and/or nouns which
are semantically unambiguous. This approach, called a closed vocabulary,
doesn’t allow for interesting relationships between the verbs to be discovered
or utilised, as well as being unnatural when compared to that of a human’s.
This thesis explores the issues with expanding the vocabulary of verbs used
for action understanding, including using an unbounded set.
For the action recognition task, videos are commonly given ground truth in
the form of a verb and a noun. Semantic knowledge from external sources
have successfully related nouns when the vocabulary size is increased from
a closed vocabulary, but has been largely under-explored for verbs. This
thesis aims to delve into this area in three ways: Firstly, open vocabulary
annotations are collected from multiple annotators and related through the
use of WordNet’s verb hierarchy. Secondly, multi-verb, verb-only annotations
are evaluated for the tasks of action recognition and action retrieval. Finally,
this thesis presents the fine-grained action retrieval task which aims to relate
videos and captions when they are semantically similar.
|Date of Award||23 Jan 2020|
|Supervisor||Dima Damen (Supervisor)|