Abstract
Multimodal clustering/diarization tries to answer the question ”who spoke when” by using audio and visual information. Diarization consists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk shows. In this paper, we use visual information to aid speaker clustering. We tested the proposed method in three full length movies, i.e. a scenario much
more difficult than the ones used so far, where there is no certainty that speech segments and video appearances of actors will always overlap. The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process.
more difficult than the ones used so far, where there is no certainty that speech segments and video appearances of actors will always overlap. The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process.
Original language | English |
---|---|
Title of host publication | Image and Graphics |
Subtitle of host publication | 8th International Conference, ICIG 2015, Tianjin, China, August 13-16, 2015, Proceedings, Part II |
Editors | Yu-Jin Zhang |
Publisher | Springer |
Pages | 547-554 |
Number of pages | 8 |
ISBN (Electronic) | 9783319219639 |
ISBN (Print) | 9783319219622 |
DOIs | |
Publication status | Published - 4 Aug 2015 |
Event | 8th International Conference, ICIG 2015 - Tianjin, China Duration: 13 Aug 2015 → 16 Aug 2015 |
Publication series
Name | Lecture Notes in Computer Science |
---|---|
Publisher | Springer |
Volume | 9218 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 8th International Conference, ICIG 2015 |
---|---|
Country/Territory | China |
City | Tianjin |
Period | 13/08/15 → 16/08/15 |
Keywords
- Multiomodal
- Diarization
- Clustering
- Movies