Multimodal Speaker Diarization Utilizing Face Clustering Information

Ioannis Kapsouoras, Anastasios Tefas, Nikos Nikolaidis, Ioannis Pitas

    Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

    329 Downloads (Pure)

    Abstract

    Multimodal clustering/diarization tries to answer the question ”who spoke when” by using audio and visual information. Diarization consists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk shows. In this paper, we use visual information to aid speaker clustering. We tested the proposed method in three full length movies, i.e. a scenario much
    more difficult than the ones used so far, where there is no certainty that speech segments and video appearances of actors will always overlap. The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process.
    Original languageEnglish
    Title of host publicationImage and Graphics
    Subtitle of host publication8th International Conference, ICIG 2015, Tianjin, China, August 13-16, 2015, Proceedings, Part II
    EditorsYu-Jin Zhang
    PublisherSpringer
    Pages547-554
    Number of pages8
    ISBN (Electronic)9783319219639
    ISBN (Print)9783319219622
    DOIs
    Publication statusPublished - 4 Aug 2015
    Event8th International Conference, ICIG 2015 - Tianjin, China
    Duration: 13 Aug 201516 Aug 2015

    Publication series

    NameLecture Notes in Computer Science
    PublisherSpringer
    Volume9218
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference8th International Conference, ICIG 2015
    Country/TerritoryChina
    CityTianjin
    Period13/08/1516/08/15

    Keywords

    • Multiomodal
    • Diarization
    • Clustering
    • Movies

    Fingerprint

    Dive into the research topics of 'Multimodal Speaker Diarization Utilizing Face Clustering Information'. Together they form a unique fingerprint.

    Cite this