在CHiME-8 NOTSOFAR-1挑战赛中，三级模块化扬声器拨号与前端技术合作

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-07-28 DOI:10.1016/j.csl.2025.101863

Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu

{"title":"在CHiME-8 NOTSOFAR-1挑战赛中，三级模块化扬声器拨号与前端技术合作","authors":"Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu","doi":"10.1016/j.csl.2025.101863","DOIUrl":null,"url":null,"abstract":"<div><div>We propose a modular speaker diarization framework that collaborates with front-end techniques in a three-stage process, designed for the challenging CHiME-8 NOTSOFAR-1 acoustic environment. The framework leverages the strengths of deep learning based speech separation systems and traditional speech signal processing techniques to provide more accurate initializations for the Neural Speaker Diarization (NSD) system at each stage, thereby enhancing the performance of a single-channel NSD system. Firstly, speaker overlap detection and Continuous Speech Separation (CSS) are applied to the multichannel speech to obtain clearer single-speaker speech segments for the Clustering-based Speaker Diarization (CSD), followed by the first NSD decoding. Next, the binary speaker masks from the first decoding are used to initialize a complex Angular Center Gaussian Mixture Model (cACGMM) to estimate speaker masks on the multi-channel speech. Using Mask-to-VAD post-processing techniques, we achieve per-speaker speech activity with reduced speaker error (SpkErr), followed by a second NSD decoding. Finally, the second decoding results are used to Guide Source Separation (GSS) to produce per-speaker speech segments. Short utterances containing one word or fewer are filtered, and the remaining speech segments are re-clustered for the final NSD decoding. We present evaluation results progressively explored from the CHiME-8 NOTSOFAR-1 challenge, demonstrating the effectiveness of our modular diarization system and its contribution to improving speech recognition performance. The code will be open-sourced at <span><span>https://github.com/rywang99/USTC-NERCSLIP_CHiME-8</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101863"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Three-stage modular speaker diarization collaborating with front-end techniques in the CHiME-8 NOTSOFAR-1 challenge\",\"authors\":\"Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu\",\"doi\":\"10.1016/j.csl.2025.101863\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>We propose a modular speaker diarization framework that collaborates with front-end techniques in a three-stage process, designed for the challenging CHiME-8 NOTSOFAR-1 acoustic environment. The framework leverages the strengths of deep learning based speech separation systems and traditional speech signal processing techniques to provide more accurate initializations for the Neural Speaker Diarization (NSD) system at each stage, thereby enhancing the performance of a single-channel NSD system. Firstly, speaker overlap detection and Continuous Speech Separation (CSS) are applied to the multichannel speech to obtain clearer single-speaker speech segments for the Clustering-based Speaker Diarization (CSD), followed by the first NSD decoding. Next, the binary speaker masks from the first decoding are used to initialize a complex Angular Center Gaussian Mixture Model (cACGMM) to estimate speaker masks on the multi-channel speech. Using Mask-to-VAD post-processing techniques, we achieve per-speaker speech activity with reduced speaker error (SpkErr), followed by a second NSD decoding. Finally, the second decoding results are used to Guide Source Separation (GSS) to produce per-speaker speech segments. Short utterances containing one word or fewer are filtered, and the remaining speech segments are re-clustered for the final NSD decoding. We present evaluation results progressively explored from the CHiME-8 NOTSOFAR-1 challenge, demonstrating the effectiveness of our modular diarization system and its contribution to improving speech recognition performance. The code will be open-sourced at <span><span>https://github.com/rywang99/USTC-NERCSLIP_CHiME-8</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"95 \",\"pages\":\"Article 101863\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230825000889\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000889","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

我们提出了一种模块化扬声器拨号框架，该框架与前端技术在三阶段过程中协同工作，专为具有挑战性的CHiME-8 NOTSOFAR-1声学环境而设计。该框架利用基于深度学习的语音分离系统和传统语音信号处理技术的优势，在每个阶段为神经说话人Diarization （NSD）系统提供更准确的初始化，从而提高单通道NSD系统的性能。首先，对多通道语音进行说话人重叠检测和连续语音分离（CSS），获得更清晰的单说话人语音片段，用于基于聚类的说话人Diarization (CSD)，然后进行第一次NSD解码。接下来，使用第一次解码的二进制扬声器掩码初始化复杂的角中心高斯混合模型（cACGMM）来估计多通道语音上的扬声器掩码。使用Mask-to-VAD后处理技术，我们实现了每个说话人的语音活动，减少了说话人的错误（SpkErr），然后进行了第二次NSD解码。最后，二次解码结果用于指导源分离（GSS）产生每个说话人的语音片段。包含一个或更少单词的短话语被过滤，剩余的语音片段被重新聚类，用于最终的NSD解码。我们提出了从CHiME-8 NOTSOFAR-1挑战中逐步探索的评估结果，证明了我们的模块化拨号系统的有效性及其对提高语音识别性能的贡献。代码将在https://github.com/rywang99/USTC-NERCSLIP_CHiME-8上开源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Three-stage modular speaker diarization collaborating with front-end techniques in the CHiME-8 NOTSOFAR-1 challenge

We propose a modular speaker diarization framework that collaborates with front-end techniques in a three-stage process, designed for the challenging CHiME-8 NOTSOFAR-1 acoustic environment. The framework leverages the strengths of deep learning based speech separation systems and traditional speech signal processing techniques to provide more accurate initializations for the Neural Speaker Diarization (NSD) system at each stage, thereby enhancing the performance of a single-channel NSD system. Firstly, speaker overlap detection and Continuous Speech Separation (CSS) are applied to the multichannel speech to obtain clearer single-speaker speech segments for the Clustering-based Speaker Diarization (CSD), followed by the first NSD decoding. Next, the binary speaker masks from the first decoding are used to initialize a complex Angular Center Gaussian Mixture Model (cACGMM) to estimate speaker masks on the multi-channel speech. Using Mask-to-VAD post-processing techniques, we achieve per-speaker speech activity with reduced speaker error (SpkErr), followed by a second NSD decoding. Finally, the second decoding results are used to Guide Source Separation (GSS) to produce per-speaker speech segments. Short utterances containing one word or fewer are filtered, and the remaining speech segments are re-clustered for the final NSD decoding. We present evaluation results progressively explored from the CHiME-8 NOTSOFAR-1 challenge, demonstrating the effectiveness of our modular diarization system and its contribution to improving speech recognition performance. The code will be open-sourced at https://github.com/rywang99/USTC-NERCSLIP_CHiME-8.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.