Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu
{"title":"MSDWild:狂野中的多模态说话人日记数据集","authors":"Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu","doi":"10.21437/interspeech.2022-10466","DOIUrl":null,"url":null,"abstract":"Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporat-ing visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this pa-per, we release MSDWild ∗ , a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1476-1480"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"MSDWild: Multi-modal Speaker Diarization Dataset in the Wild\",\"authors\":\"Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu\",\"doi\":\"10.21437/interspeech.2022-10466\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporat-ing visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this pa-per, we release MSDWild ∗ , a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"1476-1480\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-10466\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-10466","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
MSDWild: Multi-modal Speaker Diarization Dataset in the Wild
Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporat-ing visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this pa-per, we release MSDWild ∗ , a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.