DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-06-13 DOI:10.1016/j.csl.2025.101841

Alexander Polok , Dominik Klement , Martin Kocour , Jiangyu Han , Federico Landini , Bolaji Yusuf , Matthew Wiesner , Sanjeev Khudanpur , Jan Černocký , Lukáš Burget

{"title":"DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition","authors":"Alexander Polok , Dominik Klement , Martin Kocour , Jiangyu Han , Federico Landini , Bolaji Yusuf , Matthew Wiesner , Sanjeev Khudanpur , Jan Černocký , Lukáš Burget","doi":"10.1016/j.csl.2025.101841","DOIUrl":null,"url":null,"abstract":"<div><div>Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model’s focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model’s target-speaker ASR capabilities while maintaining Whisper’s accuracy and robustness on single-speaker data.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101841"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088523082500066X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model’s focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model’s target-speaker ASR capabilities while maintaining Whisper’s accuracy and robustness on single-speaker data.

查看原文本刊更多论文

DiCoW：用于目标说话者自动语音识别的定向条件耳语

多说话人环境下的说话人属性自动语音识别（ASR）仍然是一个重大挑战，特别是当基于说话人嵌入的系统无法推广到看不见的说话人时。在这项工作中，我们提出了dialization - conditioned Whisper (DiCoW)，这是一种利用说话人dialization输出作为条件反射信息的目标说话人ASR的新方法。DiCoW扩展了预训练的Whisper模型，直接集成了diarization标签，消除了对说话人嵌入的依赖，减少了对大量说话人特定训练数据的需求。我们的方法引入了帧级偏振相关变换（FDDT）和查询键偏置（QKb）技术，以改进模型对目标说话者的关注，同时有效地处理重叠语音。通过利用拨号输出作为调节信号，DiCoW简化了多扬声器ASR的工作流程，提高了对未见扬声器的泛化，并在实际多扬声器录音中实现了更可靠的转录。此外，我们探索了连接主义时间分类（CTC）头部与Whisper的整合，并证明了其通过混合解码提高转录效率的能力。值得注意的是，我们的方法并不局限于Whisper；当应用于Branchformer模型时，它也提供了类似的好处。我们在现实世界的数据集上验证了DiCoW，包括来自CHiME-8挑战的AMI和NOTSOFAR-1，以及合成基准，如Libri2Mix和LibriCSS，可以与以前的方法直接比较。结果表明，DiCoW增强了模型的目标说话人ASR能力，同时保持了Whisper在单说话人数据上的准确性和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.