基于pixit的演讲者属性ASR的设计选择：ToTaTo团队在NOTSOFAR-1挑战中

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-05-22 DOI:10.1016/j.csl.2025.101824

Joonas Kalda , Séverin Baroudi , Martin Lebourdais , Clément Pagés , Ricard Marxer , Tanel Alumäe , Hervé Bredin

{"title":"基于pixit的演讲者属性ASR的设计选择：ToTaTo团队在NOTSOFAR-1挑战中","authors":"Joonas Kalda , Séverin Baroudi , Martin Lebourdais , Clément Pagés , Ricard Marxer , Tanel Alumäe , Hervé Bredin","doi":"10.1016/j.csl.2025.101824","DOIUrl":null,"url":null,"abstract":"<div><div>PixIT is a recently proposed joint training framework that integrates Permutation Invariant Training (PIT) for speaker diarization and Mixture Invariant Training (MixIT) for speech separation. By leveraging diarization labels, PixIT addresses MixIT’s limitations, producing aligned sources and speaker activations that enable automatic long-form separation. We investigate applications of PixIT on the speaker-attributed automatic speech recognition (SA-ASR) task based on our systems for the NOTSOFAR-1 Challenge. We explore modifications to the joint ToTaToNet by integrating advanced self-supervised learning (SSL) features and masking networks. We show that fine-tuning an ASR system on PixIT-separated sources significantly boosts downstream SA-ASR performance, outperforming standard diarization-based baselines without relying on synthetic data. We explore lightweight post-processing heuristics for improving SA-ASR timestamp errors caused by long silences and artifacts present in file-level separated sources. We also show the potential of extracting speaker embeddings for the diarization pipeline directly from separated sources, with performance rivaling standard methods without any fine-tuning of speaker embeddings. On the NOTSOFAR-1 Challenge dataset, our PixIT-based approach outperforms the CSS-based baseline by 20% in terms of tcpWER after fine-tuning the ASR system on the separated sources. Notably, even when using the same ASR model as the baseline, our system is able to outperform it, without using any of the provided domain-specific synthetic data. These advancements position PixIT as a robust and flexible solution for real-world SA-ASR.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101824"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Design choices for PixIT-based speaker-attributed ASR: Team ToTaTo at the NOTSOFAR-1 challenge\",\"authors\":\"Joonas Kalda , Séverin Baroudi , Martin Lebourdais , Clément Pagés , Ricard Marxer , Tanel Alumäe , Hervé Bredin\",\"doi\":\"10.1016/j.csl.2025.101824\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>PixIT is a recently proposed joint training framework that integrates Permutation Invariant Training (PIT) for speaker diarization and Mixture Invariant Training (MixIT) for speech separation. By leveraging diarization labels, PixIT addresses MixIT’s limitations, producing aligned sources and speaker activations that enable automatic long-form separation. We investigate applications of PixIT on the speaker-attributed automatic speech recognition (SA-ASR) task based on our systems for the NOTSOFAR-1 Challenge. We explore modifications to the joint ToTaToNet by integrating advanced self-supervised learning (SSL) features and masking networks. We show that fine-tuning an ASR system on PixIT-separated sources significantly boosts downstream SA-ASR performance, outperforming standard diarization-based baselines without relying on synthetic data. We explore lightweight post-processing heuristics for improving SA-ASR timestamp errors caused by long silences and artifacts present in file-level separated sources. We also show the potential of extracting speaker embeddings for the diarization pipeline directly from separated sources, with performance rivaling standard methods without any fine-tuning of speaker embeddings. On the NOTSOFAR-1 Challenge dataset, our PixIT-based approach outperforms the CSS-based baseline by 20% in terms of tcpWER after fine-tuning the ASR system on the separated sources. Notably, even when using the same ASR model as the baseline, our system is able to outperform it, without using any of the provided domain-specific synthetic data. These advancements position PixIT as a robust and flexible solution for real-world SA-ASR.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"95 \",\"pages\":\"Article 101824\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S088523082500049X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088523082500049X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

PixIT是最近提出的一种联合训练框架，它集成了用于说话人特征化的排列不变性训练（PIT）和用于语音分离的混合不变性训练（MixIT）。通过利用diarization标签，PixIT解决了MixIT的局限性，生成对齐的源和扬声器激活，从而实现自动长格式分离。基于我们的NOTSOFAR-1挑战赛系统，我们研究了PixIT在说话人属性自动语音识别（SA-ASR）任务中的应用。我们通过集成高级自监督学习（SSL）特征和屏蔽网络来探索对联合ToTaToNet的修改。我们发现，在pixit分离源上对ASR系统进行微调可以显著提高下游SA-ASR性能，在不依赖合成数据的情况下优于标准的基于数字化的基线。我们探索轻量级的后处理启发式方法，以改进由文件级分离源中存在的长沉默和工件引起的SA-ASR时间戳错误。我们还展示了直接从分离的源中提取扬声器嵌入的潜力，其性能可与标准方法相媲美，而无需对扬声器嵌入进行任何微调。在NOTSOFAR-1 Challenge数据集上，在对分离源的ASR系统进行微调后，我们基于pixit的方法在tcpWER方面比基于css的基线高出20%。值得注意的是，即使在使用相同的ASR模型作为基线时，我们的系统也能够在不使用任何提供的特定于领域的合成数据的情况下优于它。这些进步使PixIT成为现实世界SA-ASR的强大而灵活的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Design choices for PixIT-based speaker-attributed ASR: Team ToTaTo at the NOTSOFAR-1 challenge

PixIT is a recently proposed joint training framework that integrates Permutation Invariant Training (PIT) for speaker diarization and Mixture Invariant Training (MixIT) for speech separation. By leveraging diarization labels, PixIT addresses MixIT’s limitations, producing aligned sources and speaker activations that enable automatic long-form separation. We investigate applications of PixIT on the speaker-attributed automatic speech recognition (SA-ASR) task based on our systems for the NOTSOFAR-1 Challenge. We explore modifications to the joint ToTaToNet by integrating advanced self-supervised learning (SSL) features and masking networks. We show that fine-tuning an ASR system on PixIT-separated sources significantly boosts downstream SA-ASR performance, outperforming standard diarization-based baselines without relying on synthetic data. We explore lightweight post-processing heuristics for improving SA-ASR timestamp errors caused by long silences and artifacts present in file-level separated sources. We also show the potential of extracting speaker embeddings for the diarization pipeline directly from separated sources, with performance rivaling standard methods without any fine-tuning of speaker embeddings. On the NOTSOFAR-1 Challenge dataset, our PixIT-based approach outperforms the CSS-based baseline by 20% in terms of tcpWER after fine-tuning the ASR system on the separated sources. Notably, even when using the same ASR model as the baseline, our system is able to outperform it, without using any of the provided domain-specific synthetic data. These advancements position PixIT as a robust and flexible solution for real-world SA-ASR.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.