Qinwen Hu , Tianchi Sun , Xin’an Chen , Xiaobin Rong , Jing Lu
{"title":"Optimization of modular multi-speaker distant conversational speech recognition","authors":"Qinwen Hu , Tianchi Sun , Xin’an Chen , Xiaobin Rong , Jing Lu","doi":"10.1016/j.csl.2025.101816","DOIUrl":null,"url":null,"abstract":"<div><div>Conducting multi-speaker distant conversational speech recognition on real meeting recordings is a challenging task and has recently become an active area of research. In this work, we focus on modular approaches to addressing this challenge, integrating continuous speech separation (CSS), automatic speech recognition (ASR), and speaker diarization in a pipeline. We explore the effective utilization of the high-performing separation model, TF-GridNet, within our system and propose integration techniques to enhance the performance of the ASR and diarization modules. Our system is evaluated on both the LibriCSS and the real-world CHiME-8 NOTSOFAR-1 dataset. Through a comprehensive analysis of the system’s generalization performance, we identify key areas for further improvement in the front-end module.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101816"},"PeriodicalIF":3.1000,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000415","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Conducting multi-speaker distant conversational speech recognition on real meeting recordings is a challenging task and has recently become an active area of research. In this work, we focus on modular approaches to addressing this challenge, integrating continuous speech separation (CSS), automatic speech recognition (ASR), and speaker diarization in a pipeline. We explore the effective utilization of the high-performing separation model, TF-GridNet, within our system and propose integration techniques to enhance the performance of the ASR and diarization modules. Our system is evaluated on both the LibriCSS and the real-world CHiME-8 NOTSOFAR-1 dataset. Through a comprehensive analysis of the system’s generalization performance, we identify key areas for further improvement in the front-end module.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.