Abstract
With the ever-increasing demand for higher perceptual quality video content, enhancementtechniques such as video frame interpolation (VFI), which aims to generate non-existent
intermediate frames between consecutive original frames in a video, has attracted
increasing research attention. While existing deep learning-based VFI research has seen remarkable progress in interpolation quality, we identify three major limitations: the lack of a
model architecture that jointly addresses various challenges in VFI (model architecture), the
under-investigation of alignment between subjective and objective quality measures (perceptual
quality assessment), and the absence of a study on generative modelling for VFI (modelling
approach).
In this context, this thesis aims to propose novel VFI methods and databases to narrow these
three research gaps. Specifically, we first perform a video texture-based analysis on state-of-theart VFI methods and reveal their inconsistent performance under different challenging scenarios.
Then, under the current distortion-based paradigm, we propose two novel VFI architectures to
jointly handle occlusion, large motion, and dynamic textures. These methods combine existing
flow-based and kernel-based VFI approaches with novel components, and our comprehensive
evaluation results clearly demonstrate that they consistently outperform existing methods on
varied and representative test datasets, with significant gains up to 1.09dB in PSNR for cases
including large motion and dynamic textures.
Secondly, to advance our understanding of how humans perceive the quality of frameinterpolated content and how well existing objective quality assessment methods perform when
measuring the perceived quality, we develop a new video quality database named BVI-VFI. We
conducted a large scale subjective study to collect subjective scores on VFI quality, with which we
further analysed the influence of VFI algorithms and different video features on the perceptual
quality of interpolated videos. Moreover, we benchmarked the performance of 28 classic and
state-of-the-art objective image/video quality metrics on the new database, and demonstrated the
urgent requirement for more accurate bespoke quality assessment methods for VFI. Based on this
observation, we propose FloLPIPS, the first bespoke video quality assessment method for VFI.
FloLPIPS has been quantitatively benchmarked, and the results demonstrate its state-of-the-art
performance, with an improvement of 9% in correlation compared to the best baseline.
Thirdly, towards developing perceptually-oriented VFI methods, we propose latent diffusion
model-based VFI, LDMVFI. This approaches the VFI problem from a generative perspective by
formulating it as a conditional generation problem. As the first effort to address VFI using latent
diffusion models, we rigorously benchmark our method on common test sets used in the existing
VFI literature. Our quantitative experiments and user study indicate that LDMVFI is able to
interpolate video content with favourable perceptual quality compared to the state of the art,
even in the high-resolution regime, with a notable 20% performance gain on a challenging 4K
test set.
Date of Award | 10 Dec 2024 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | David R Bull (Supervisor), Fan Zhang (Supervisor) & Andrew Calway (Supervisor) |