Timothée Dhaussy, B. Jabaian, F. Lefèvre, R. Horaud
{"title":"Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction","authors":"Timothée Dhaussy, B. Jabaian, F. Lefèvre, R. Horaud","doi":"10.1109/ICASSP49357.2023.10096295","DOIUrl":null,"url":null,"abstract":"The speaker diarization task answers the question \"who is speaking at a given time?\". It represents valuable information for scene analysis in a domain such as robotics. In this paper, we introduce a temporal audio-visual fusion model for multiusers speaker diarization, with low computing requirement, a good robustness and an absence of training phase. The proposed method identifies the dominant speakers and tracks them over time by measuring the spatial coincidence between sound locations and visual presence. The model is generative, parameters are estimated online, and does not require training. Its effectiveness was assessed using two datasets, a public one and one collected in-house with the Pepper humanoid robot.","PeriodicalId":113072,"journal":{"name":"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP49357.2023.10096295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The speaker diarization task answers the question "who is speaking at a given time?". It represents valuable information for scene analysis in a domain such as robotics. In this paper, we introduce a temporal audio-visual fusion model for multiusers speaker diarization, with low computing requirement, a good robustness and an absence of training phase. The proposed method identifies the dominant speakers and tracks them over time by measuring the spatial coincidence between sound locations and visual presence. The model is generative, parameters are estimated online, and does not require training. Its effectiveness was assessed using two datasets, a public one and one collected in-house with the Pepper humanoid robot.