End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors

IF 4.1 2区 计算机科学 Q1 ACOUSTICS
Magdalena Rybicka;Jesús Villalba;Thomas Thebaud;Najim Dehak;Konrad Kowalczyk
{"title":"End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors","authors":"Magdalena Rybicka;Jesús Villalba;Thomas Thebaud;Najim Dehak;Konrad Kowalczyk","doi":"10.1109/TASLP.2024.3439993","DOIUrl":null,"url":null,"abstract":"Despite many recent developments in speaker diarization, it remains a challenge and an active area of research to make diarization robust and effective in real-life scenarios. Well-established clustering-based methods are showing good performance and qualities. However, such systems are built of several independent, separately optimized modules, which may cause non-optimum performance. End-to-end neural speaker diarization (EEND) systems are considered the next stepping stone in pursuing high-performance diarization. Nevertheless, this approach also suffers limitations, such as dealing with long recordings and scenarios with a large (more than four) or unknown number of speakers in the recording. The appearance of EEND with encoder-decoder-based attractors (EEND-EDA) enabled us to deal with recordings that contain a flexible number of speakers thanks to an LSTM-based EDA module. A competitive alternative over the referenced EEND-EDA baseline is the EEND with non-autoregressive attractor (EEND-NAA) estimation, proposed recently by the authors of this article. NAA back-end incorporates k-means clustering as part of the attractor estimation and an attractor refinement module based on a Transformer decoder. However, in our previous work on EEND-NAA, we assumed a known number of speakers, and the experimental evaluation was limited to 2-speaker recordings only. In this article, we describe in detail our recent EEND-NAA approach and propose further improvements to the EEND-NAA architecture, introducing three novel variants of the NAA back-end, which can handle recordings containing speech of a variable and unknown number of speakers. Conducted experiments include simulated mixtures generated using the Switchboard and NIST SRE datasets and real-life recordings from the CALLHOME and DIHARD II datasets. In experimental evaluation, the proposed systems achieve up to 51% relative improvement for the simulated scenario and up to 15% for real recordings over the baseline EEND-EDA.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3960-3973"},"PeriodicalIF":4.1000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10629182/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Despite many recent developments in speaker diarization, it remains a challenge and an active area of research to make diarization robust and effective in real-life scenarios. Well-established clustering-based methods are showing good performance and qualities. However, such systems are built of several independent, separately optimized modules, which may cause non-optimum performance. End-to-end neural speaker diarization (EEND) systems are considered the next stepping stone in pursuing high-performance diarization. Nevertheless, this approach also suffers limitations, such as dealing with long recordings and scenarios with a large (more than four) or unknown number of speakers in the recording. The appearance of EEND with encoder-decoder-based attractors (EEND-EDA) enabled us to deal with recordings that contain a flexible number of speakers thanks to an LSTM-based EDA module. A competitive alternative over the referenced EEND-EDA baseline is the EEND with non-autoregressive attractor (EEND-NAA) estimation, proposed recently by the authors of this article. NAA back-end incorporates k-means clustering as part of the attractor estimation and an attractor refinement module based on a Transformer decoder. However, in our previous work on EEND-NAA, we assumed a known number of speakers, and the experimental evaluation was limited to 2-speaker recordings only. In this article, we describe in detail our recent EEND-NAA approach and propose further improvements to the EEND-NAA architecture, introducing three novel variants of the NAA back-end, which can handle recordings containing speech of a variable and unknown number of speakers. Conducted experiments include simulated mixtures generated using the Switchboard and NIST SRE datasets and real-life recordings from the CALLHOME and DIHARD II datasets. In experimental evaluation, the proposed systems achieve up to 51% relative improvement for the simulated scenario and up to 15% for real recordings over the baseline EEND-EDA.
使用非自回归吸引子的端到端神经扬声器标示法
尽管最近在说话人日记化方面取得了许多进展,但如何使日记化在现实生活中既稳健又有效,仍然是一个挑战,也是一个活跃的研究领域。基于聚类的成熟方法显示出良好的性能和质量。然而,这类系统由多个独立的、单独优化的模块组成,可能会导致性能不理想。端到端神经扬声器日记化(EEND)系统被认为是追求高性能日记化的下一块基石。然而,这种方法也有其局限性,比如在处理长录音和录音中扬声器数量较多(超过四个)或未知扬声器数量的情况时。基于编码器-解码器吸引子的 EEND(EEND-EDA)的出现,使我们能够利用基于 LSTM 的 EDA 模块,灵活处理包含大量发言人的录音。与 EEND-EDA 基线相比,本文作者最近提出的具有非自回归吸引子的 EEND(EEND-NAA)估计是一种有竞争力的替代方案。NAA 后端将 k-means 聚类作为吸引子估计的一部分,并采用基于变换解码器的吸引子细化模块。不过,在我们之前的 EEND-NAA 工作中,我们假设了已知的扬声器数量,而且实验评估仅限于 2 个扬声器的录音。在本文中,我们将详细介绍我们最近的 EEND-NAA 方法,并提出对 EEND-NAA 架构的进一步改进,引入 NAA 后端的三种新变体,它们可以处理包含不同和未知发言人数的语音录音。实验包括使用 Switchboard 和 NIST SRE 数据集生成的模拟混合物,以及来自 CALLHOME 和 DIHARD II 数据集的真实录音。在实验评估中,与基线 EEND-EDA 相比,建议的系统在模拟场景中实现了高达 51% 的相对改进,在真实录音中实现了高达 15% 的相对改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
11.30
自引率
11.10%
发文量
217
期刊介绍: The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信