End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors

IF 5.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-07 DOI:10.1109/TASLP.2024.3439993

Magdalena Rybicka;Jesús Villalba;Thomas Thebaud;Najim Dehak;Konrad Kowalczyk

{"title":"End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors","authors":"Magdalena Rybicka;Jesús Villalba;Thomas Thebaud;Najim Dehak;Konrad Kowalczyk","doi":"10.1109/TASLP.2024.3439993","DOIUrl":null,"url":null,"abstract":"Despite many recent developments in speaker diarization, it remains a challenge and an active area of research to make diarization robust and effective in real-life scenarios. Well-established clustering-based methods are showing good performance and qualities. However, such systems are built of several independent, separately optimized modules, which may cause non-optimum performance. End-to-end neural speaker diarization (EEND) systems are considered the next stepping stone in pursuing high-performance diarization. Nevertheless, this approach also suffers limitations, such as dealing with long recordings and scenarios with a large (more than four) or unknown number of speakers in the recording. The appearance of EEND with encoder-decoder-based attractors (EEND-EDA) enabled us to deal with recordings that contain a flexible number of speakers thanks to an LSTM-based EDA module. A competitive alternative over the referenced EEND-EDA baseline is the EEND with non-autoregressive attractor (EEND-NAA) estimation, proposed recently by the authors of this article. NAA back-end incorporates k-means clustering as part of the attractor estimation and an attractor refinement module based on a Transformer decoder. However, in our previous work on EEND-NAA, we assumed a known number of speakers, and the experimental evaluation was limited to 2-speaker recordings only. In this article, we describe in detail our recent EEND-NAA approach and propose further improvements to the EEND-NAA architecture, introducing three novel variants of the NAA back-end, which can handle recordings containing speech of a variable and unknown number of speakers. Conducted experiments include simulated mixtures generated using the Switchboard and NIST SRE datasets and real-life recordings from the CALLHOME and DIHARD II datasets. In experimental evaluation, the proposed systems achieve up to 51% relative improvement for the simulated scenario and up to 15% for real recordings over the baseline EEND-EDA.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3960-3973"},"PeriodicalIF":5.1000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10629182/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Despite many recent developments in speaker diarization, it remains a challenge and an active area of research to make diarization robust and effective in real-life scenarios. Well-established clustering-based methods are showing good performance and qualities. However, such systems are built of several independent, separately optimized modules, which may cause non-optimum performance. End-to-end neural speaker diarization (EEND) systems are considered the next stepping stone in pursuing high-performance diarization. Nevertheless, this approach also suffers limitations, such as dealing with long recordings and scenarios with a large (more than four) or unknown number of speakers in the recording. The appearance of EEND with encoder-decoder-based attractors (EEND-EDA) enabled us to deal with recordings that contain a flexible number of speakers thanks to an LSTM-based EDA module. A competitive alternative over the referenced EEND-EDA baseline is the EEND with non-autoregressive attractor (EEND-NAA) estimation, proposed recently by the authors of this article. NAA back-end incorporates k-means clustering as part of the attractor estimation and an attractor refinement module based on a Transformer decoder. However, in our previous work on EEND-NAA, we assumed a known number of speakers, and the experimental evaluation was limited to 2-speaker recordings only. In this article, we describe in detail our recent EEND-NAA approach and propose further improvements to the EEND-NAA architecture, introducing three novel variants of the NAA back-end, which can handle recordings containing speech of a variable and unknown number of speakers. Conducted experiments include simulated mixtures generated using the Switchboard and NIST SRE datasets and real-life recordings from the CALLHOME and DIHARD II datasets. In experimental evaluation, the proposed systems achieve up to 51% relative improvement for the simulated scenario and up to 15% for real recordings over the baseline EEND-EDA.

查看原文本刊更多论文

使用非自回归吸引子的端到端神经扬声器标示法

尽管最近在说话人日记化方面取得了许多进展，但如何使日记化在现实生活中既稳健又有效，仍然是一个挑战，也是一个活跃的研究领域。基于聚类的成熟方法显示出良好的性能和质量。然而，这类系统由多个独立的、单独优化的模块组成，可能会导致性能不理想。端到端神经扬声器日记化（EEND）系统被认为是追求高性能日记化的下一块基石。然而，这种方法也有其局限性，比如在处理长录音和录音中扬声器数量较多（超过四个）或未知扬声器数量的情况时。基于编码器-解码器吸引子的 EEND（EEND-EDA）的出现，使我们能够利用基于 LSTM 的 EDA 模块，灵活处理包含大量发言人的录音。与 EEND-EDA 基线相比，本文作者最近提出的具有非自回归吸引子的 EEND（EEND-NAA）估计是一种有竞争力的替代方案。NAA 后端将 k-means 聚类作为吸引子估计的一部分，并采用基于变换解码器的吸引子细化模块。不过，在我们之前的 EEND-NAA 工作中，我们假设了已知的扬声器数量，而且实验评估仅限于 2 个扬声器的录音。在本文中，我们将详细介绍我们最近的 EEND-NAA 方法，并提出对 EEND-NAA 架构的进一步改进，引入 NAA 后端的三种新变体，它们可以处理包含不同和未知发言人数的语音录音。实验包括使用 Switchboard 和 NIST SRE 数据集生成的模拟混合物，以及来自 CALLHOME 和 DIHARD II 数据集的真实录音。在实验评估中，与基线 EEND-EDA 相比，建议的系统在模拟场景中实现了高达 51% 的相对改进，在真实录音中实现了高达 15% 的相对改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.