Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu
{"title":"在CHiME-8 NOTSOFAR-1挑战赛中,三级模块化扬声器拨号与前端技术合作","authors":"Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu","doi":"10.1016/j.csl.2025.101863","DOIUrl":null,"url":null,"abstract":"<div><div>We propose a modular speaker diarization framework that collaborates with front-end techniques in a three-stage process, designed for the challenging CHiME-8 NOTSOFAR-1 acoustic environment. The framework leverages the strengths of deep learning based speech separation systems and traditional speech signal processing techniques to provide more accurate initializations for the Neural Speaker Diarization (NSD) system at each stage, thereby enhancing the performance of a single-channel NSD system. Firstly, speaker overlap detection and Continuous Speech Separation (CSS) are applied to the multichannel speech to obtain clearer single-speaker speech segments for the Clustering-based Speaker Diarization (CSD), followed by the first NSD decoding. Next, the binary speaker masks from the first decoding are used to initialize a complex Angular Center Gaussian Mixture Model (cACGMM) to estimate speaker masks on the multi-channel speech. Using Mask-to-VAD post-processing techniques, we achieve per-speaker speech activity with reduced speaker error (SpkErr), followed by a second NSD decoding. Finally, the second decoding results are used to Guide Source Separation (GSS) to produce per-speaker speech segments. Short utterances containing one word or fewer are filtered, and the remaining speech segments are re-clustered for the final NSD decoding. We present evaluation results progressively explored from the CHiME-8 NOTSOFAR-1 challenge, demonstrating the effectiveness of our modular diarization system and its contribution to improving speech recognition performance. The code will be open-sourced at <span><span>https://github.com/rywang99/USTC-NERCSLIP_CHiME-8</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101863"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Three-stage modular speaker diarization collaborating with front-end techniques in the CHiME-8 NOTSOFAR-1 challenge\",\"authors\":\"Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu\",\"doi\":\"10.1016/j.csl.2025.101863\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>We propose a modular speaker diarization framework that collaborates with front-end techniques in a three-stage process, designed for the challenging CHiME-8 NOTSOFAR-1 acoustic environment. The framework leverages the strengths of deep learning based speech separation systems and traditional speech signal processing techniques to provide more accurate initializations for the Neural Speaker Diarization (NSD) system at each stage, thereby enhancing the performance of a single-channel NSD system. Firstly, speaker overlap detection and Continuous Speech Separation (CSS) are applied to the multichannel speech to obtain clearer single-speaker speech segments for the Clustering-based Speaker Diarization (CSD), followed by the first NSD decoding. Next, the binary speaker masks from the first decoding are used to initialize a complex Angular Center Gaussian Mixture Model (cACGMM) to estimate speaker masks on the multi-channel speech. Using Mask-to-VAD post-processing techniques, we achieve per-speaker speech activity with reduced speaker error (SpkErr), followed by a second NSD decoding. Finally, the second decoding results are used to Guide Source Separation (GSS) to produce per-speaker speech segments. Short utterances containing one word or fewer are filtered, and the remaining speech segments are re-clustered for the final NSD decoding. We present evaluation results progressively explored from the CHiME-8 NOTSOFAR-1 challenge, demonstrating the effectiveness of our modular diarization system and its contribution to improving speech recognition performance. The code will be open-sourced at <span><span>https://github.com/rywang99/USTC-NERCSLIP_CHiME-8</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"95 \",\"pages\":\"Article 101863\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230825000889\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000889","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Three-stage modular speaker diarization collaborating with front-end techniques in the CHiME-8 NOTSOFAR-1 challenge
We propose a modular speaker diarization framework that collaborates with front-end techniques in a three-stage process, designed for the challenging CHiME-8 NOTSOFAR-1 acoustic environment. The framework leverages the strengths of deep learning based speech separation systems and traditional speech signal processing techniques to provide more accurate initializations for the Neural Speaker Diarization (NSD) system at each stage, thereby enhancing the performance of a single-channel NSD system. Firstly, speaker overlap detection and Continuous Speech Separation (CSS) are applied to the multichannel speech to obtain clearer single-speaker speech segments for the Clustering-based Speaker Diarization (CSD), followed by the first NSD decoding. Next, the binary speaker masks from the first decoding are used to initialize a complex Angular Center Gaussian Mixture Model (cACGMM) to estimate speaker masks on the multi-channel speech. Using Mask-to-VAD post-processing techniques, we achieve per-speaker speech activity with reduced speaker error (SpkErr), followed by a second NSD decoding. Finally, the second decoding results are used to Guide Source Separation (GSS) to produce per-speaker speech segments. Short utterances containing one word or fewer are filtered, and the remaining speech segments are re-clustered for the final NSD decoding. We present evaluation results progressively explored from the CHiME-8 NOTSOFAR-1 challenge, demonstrating the effectiveness of our modular diarization system and its contribution to improving speech recognition performance. The code will be open-sourced at https://github.com/rywang99/USTC-NERCSLIP_CHiME-8.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.