Abstract
Hands are some of the most complex and enabling instruments that humans use. Their dexterity,from their multiple degrees of freedom, has allowed people to create, communicate, and overall
transform thoughts into actions that change the world.
The advent of artificial sensing from cameras promises that the richness of data that hands
can offer can be taken further if understood and processed. Being able to decode what hands
do opens up multiple possibilities for human-computer interaction, especially when hands can
be observed without intrusion. From computers and robotic devices learning how to perform
actions, to translating explicit and implicit communication and also linking with Extended Reality
applications (VR and AR), the decoding of hands is a key pursuit for understanding people and
how we operate in the world.
This thesis starts with hand feature extraction as a way to decode what hands are doing.
To begin, we examine the hand’s most critical feature: the hand pose, and we implement two
approaches for extracting it: model-based and data-based. For the model-based approach, a
kinematics-based hand skeleton is built to fit the observed depth data to achieve hand pose
estimation. In the data-based approach, we implement a model based on a convolutional neural
network to predict the position of the hand joints. To address the data scarcity issue, we created
a synthetic dataset and a multi-camera data collection setup. The synthetic data can be given an
exact annotation, and the amount of data and the way in which it is collected can be programmed.
The multi-cam is built to compensate for the lack of realism in the synthetic data and to allow
i
data to be collected when the hand is obscured by an object.
To further investigate the relationship between the hand and the objects in the environment,
we attempted to extract additional features such as the hand mask and in-hand object mask
during the hand-object interaction. The features obtained from the in-hand object interactions
were then combined to determine the hand-object interaction’s status. The thesis then looks into
how the extracted hand features can be used in human-computer interactions. We examined
two possible scenarios. The first scenario is to capture how humans interact with a machine
they need to operate, so that the system can subsequently teach a novice without the use of
paper instructions. We invited participants to help us evaluate our system. The results based on
sewing machine operations show that our system is more efficient than the paper instruction
for participants without any expertise. The second application scenario we demonstrate is how
to segment a hand manipulation video automatically by extracting the features of both hands
in a complex scene with various objects. We invited participants to conduct experiments on our
own collected and labelled Chinese Tea Making dataset. The results showed that the automatic
segmentation results were very close to expert’s manual annotation results, and the participants
were able to identify the steps in the task with watching the segmented video clips.
Overall, we hope that the research presented in this thesis will inspire future work on hands
and broaden the scope of uses for hands as a cue in intelligent systems. The code related to our
works can be found in this link
Date of Award | 22 Mar 2022 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Walterio W Mayol-Cuevas (Supervisor) |