Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

25 Downloads (Pure)

Abstract

This paper presents a deep learning framework for medical video segmentation. Convolution neural network (CNN) and transformer-based methods have achieved great milestones in medical image segmentation tasks due to their incredible semantic feature encoding and global information comprehension abilities. However, most existing approaches ignore a salient aspect of medical video data - the temporal dimension. Our proposed framework explicitly extracts features from neighbouring frames across the temporal dimension and incorporates them with a temporal feature blender, which then tokenises the high-level spatio-temporal feature to form a strong global feature encoded via a Swin Transformer. The final segmentation results are produced via a UNet-like encoder-decoder architecture. Our model outperforms other approaches by a significant margin and improves the segmentation benchmarks on the VFSS2022 dataset, achieving a dice coefficient of 0.8986 and 0.8186 for the two datasets tested. Our studies also show the efficacy of the temporal feature blending scheme and cross-dataset transferability of learned capabilities. Code and models are fully available at https://github.com/SimonZeng7108/Video-SwinUNet.
Original languageEnglish
Title of host publication2023 IEEE International Conference on Image Processing (ICIP)
Pages2470-2474
Number of pages5
ISBN (Electronic)9781728198354
DOIs
Publication statusPublished - 11 Sept 2023
Event2023 IEEE International Conference on Image Processing - Kuala Lumpur Convention Centre, Kuala Lumpur , Malaysia
Duration: 8 Oct 202311 Oct 2023

Conference

Conference2023 IEEE International Conference on Image Processing
Abbreviated titleICIP 2023
Country/TerritoryMalaysia
CityKuala Lumpur
Period8/10/2311/10/23

Keywords

  • cs.CV
  • cs.AI

Fingerprint

Dive into the research topics of 'Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation'. Together they form a unique fingerprint.

Cite this