End-to-End Audio-Visual Neural Speaker Diarization

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-10106

Maokui He, Jun Du, Chin-Hui Lee

{"title":"End-to-End Audio-Visual Neural Speaker Diarization","authors":"Maokui He, Jun Du, Chin-Hui Lee","doi":"10.21437/interspeech.2022-10106","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a novel end-to-end neural-network-based audio-visual speaker diarization method. Unlike most existing audio-visual methods, our audio-visual model takes audio features (e.g., FBANKs), multi-speaker lip regions of in-terest (ROIs), and multi-speaker i-vector embbedings as multi-modal inputs. And a set of binary classiﬁcation output layers produces activities of each speaker. With the ﬁnely designed end-to-end structure, the proposed method can explicitly handle the overlapping speech and distinguish between speech and non-speech accurately with multi-modal information. I-vectors are the key point to solve the alignment problem caused by visual modality error (e.g., occlusions, off-screen speakers or unreliable detection). Besides, our audio-visual model is robust to the absence of visual modality, where the diarization performance degrades signiﬁcantly using the visual-only model. Evaluated on the datasets of the ﬁrst multi-model information based speech processing (MISP) challenge, the proposed method achieved diarization error rates (DERs) of 10.1%/9.5% on development/eval set with reference voice activity detection (VAD) information, while audio-only and video-only system yielded DERs of 27.9%/29.0% and 14.6%/13.1% respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-10106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

In this paper, we propose a novel end-to-end neural-network-based audio-visual speaker diarization method. Unlike most existing audio-visual methods, our audio-visual model takes audio features (e.g., FBANKs), multi-speaker lip regions of in-terest (ROIs), and multi-speaker i-vector embbedings as multi-modal inputs. And a set of binary classiﬁcation output layers produces activities of each speaker. With the ﬁnely designed end-to-end structure, the proposed method can explicitly handle the overlapping speech and distinguish between speech and non-speech accurately with multi-modal information. I-vectors are the key point to solve the alignment problem caused by visual modality error (e.g., occlusions, off-screen speakers or unreliable detection). Besides, our audio-visual model is robust to the absence of visual modality, where the diarization performance degrades signiﬁcantly using the visual-only model. Evaluated on the datasets of the ﬁrst multi-model information based speech processing (MISP) challenge, the proposed method achieved diarization error rates (DERs) of 10.1%/9.5% on development/eval set with reference voice activity detection (VAD) information, while audio-only and video-only system yielded DERs of 27.9%/29.0% and 14.6%/13.1% respectively.

查看原文本刊更多论文

端到端视听神经扬声器日记

在本文中，我们提出了一种新的基于端到端神经网络的视听说话者日记化方法。与大多数现有的视听方法不同，我们的视听模型将音频特征（例如，FBANK）、多扬声器唇区（ROI）和多扬声器i矢量嵌入作为多模态输入。一组二进制分类输出层产生每个说话者的活动。通过精心设计的端到端结构，该方法可以明确处理重叠语音，并利用多模态信息准确区分语音和非语音。I矢量是解决视觉模态误差（如遮挡、屏幕外扬声器或不可靠检测）引起的对准问题的关键。此外，我们的视听模型在没有视觉模态的情况下是稳健的，使用纯视觉模型，日记化性能显著下降。在第一次基于多模型信息的语音处理（MISP）挑战的数据集上进行评估，所提出的方法在具有参考语音活动检测（VAD）信息的开发/评估集上实现了10.1%/9.5%的二值化错误率（DERs），而纯音频和纯视频系统的DERs分别为27.9%/29.0%和14.6%/13.1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Interspeech

自引率

0.00%

发文量