Visual voice activity detection based on spatiotemporal information and bag of words

Foteini Patrona, Alexandros Iosifidis, Anastasios Tefas, Nikos Nikolaidis, Ioannis Pitas

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

280 Downloads (Pure)

Abstract

A novel method for Visual Voice Activity Detection (V-VAD) that exploits local shape and motion information appearing at spatiotemporal locations of interest for facial region video description and the Bag of Words (BoW) model for facial
region video representation is proposed in this paper. Facial region video classification is subsequently performed based on Single-hidden Layer Feedforward Neural (SLFN) network trained by applying the recently proposed kernel Extreme Learning Machine (kELM) algorithm on training facial videos depicting talking and non-talking persons. Experimental results on two publicly available V-VAD data sets, denote the effectiveness of the proposed method, since better generalization performance in unseen users is achieved, compared to recently proposed state-of-the-art methods.
Original languageEnglish
Title of host publication2015 IEEE International Conference on Image Processing (ICIP 2015)
Subtitle of host publicationProceedings of a meeting held 27-30 September 2015, Quebec City, Quebec, Canada
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages2334-2338
Number of pages5
ISBN (Electronic)9781479983391
ISBN (Print)9781479983407
DOIs
Publication statusPublished - Jan 2016
Event2015 IEEE International Conference on Image Processing (ICIP) - Quebec City, ON, Canada
Duration: 27 Sep 201530 Sep 2015

Conference

Conference2015 IEEE International Conference on Image Processing (ICIP)
CountryCanada
CityQuebec City, ON
Period27/09/1530/09/15

Keywords

  • Voice Activity Detection
  • Space-Time Interest Points
  • Bag of Words model
  • kernel Extreme Learning Machine

Fingerprint Dive into the research topics of 'Visual voice activity detection based on spatiotemporal information and bag of words'. Together they form a unique fingerprint.

Cite this