Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition

IF 3.2 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Lin Zheng;Han Zhu;Sanli Tian;Qingwei Zhao;Ta Li
{"title":"Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition","authors":"Lin Zheng;Han Zhu;Sanli Tian;Qingwei Zhao;Ta Li","doi":"10.1109/LSP.2024.3487795","DOIUrl":null,"url":null,"abstract":"Serialized Output Training (SOT) has emerged as the mainstream approach for addressing the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT encounters cross-domain performance degradation which hinders its application. Meanwhile, traditional domain adaption methods may harm the accuracy of speaker change point prediction evaluated by UD-CER, which is an important metric in SOT. To solve these issues, we propose Pseudo-Labeling based SOT (PL-SOT) for domain adaptation by treating speaker change token (\n<inline-formula><tex-math>$&lt; $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\n) specially during training to increase the accuracy of speaker change point prediction. Firstly, we improve CTC loss by proposing \n<italic>Weakening and Enhancing CTC</i>\n (WE-CTC) loss to weaken the learning of error-prone labels surrounding \n<inline-formula><tex-math>$&lt;$</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\n while enhance the emission probability of \n<inline-formula><tex-math>$&lt; $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\n through modifying posteriors of the pseudo-labels. Secondly, we introduce \n<italic>Weighted Confidence Filter</i>\n (WCF) that assigns higher scores of \n<inline-formula><tex-math>$&lt;$</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\n to exclude low-quality pseudo-labels without hurting the \n<inline-formula><tex-math>$&lt; $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\n prediction. Experimental results show that PL-SOT achieves 17.7%/12.8% average relative reduction of CER/UD-CER, with AliMeeting as source domain and AISHELL-4 along with MagicData-RAMC as target domain.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3119-3123"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10737652/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Serialized Output Training (SOT) has emerged as the mainstream approach for addressing the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT encounters cross-domain performance degradation which hinders its application. Meanwhile, traditional domain adaption methods may harm the accuracy of speaker change point prediction evaluated by UD-CER, which is an important metric in SOT. To solve these issues, we propose Pseudo-Labeling based SOT (PL-SOT) for domain adaptation by treating speaker change token ( $< $ sc $>$ ) specially during training to increase the accuracy of speaker change point prediction. Firstly, we improve CTC loss by proposing Weakening and Enhancing CTC (WE-CTC) loss to weaken the learning of error-prone labels surrounding $<$ sc $>$ while enhance the emission probability of $< $ sc $>$ through modifying posteriors of the pseudo-labels. Secondly, we introduce Weighted Confidence Filter (WCF) that assigns higher scores of $<$ sc $>$ to exclude low-quality pseudo-labels without hurting the $< $ sc $>$ prediction. Experimental results show that PL-SOT achieves 17.7%/12.8% average relative reduction of CER/UD-CER, with AliMeeting as source domain and AISHELL-4 along with MagicData-RAMC as target domain.
端到端多对话者重叠语音识别中的无监督领域自适应
序列化输出训练(SOT)因其简单易行,已成为解决多说话者重叠语音识别难题的主流方法。然而,SOT 会遇到跨域性能下降的问题,这阻碍了它的应用。同时,传统的域自适应方法可能会损害通过 UD-CER 评估的说话人变化点预测的准确性,而 UD-CER 是 SOT 的一个重要指标。为了解决这些问题,我们提出了基于伪标记的 SOT(PL-SOT)领域适应方法,在训练过程中对说话人变化标记($< $sc$>$)进行特殊处理,以提高说话人变化点预测的准确性。首先,我们通过提出弱化和增强 CTC(Weakening and Enhancing CTC,WE-CTC)损失来改进 CTC 损失,以弱化对 $<$sc$>$ 周围易出错标签的学习,同时通过修改伪标签的后验值来增强 $< $sc$>$ 的发射概率。其次,我们引入了加权置信过滤器(WCF),在不影响 $< $sc$>$ 预测的情况下,为 $< $sc$>$ 分配更高的分数,以排除低质量的伪标签。实验结果表明,以 AliMeeting 为源域,AISHELL-4 和 MagicData-RAMC 为目标域,PL-SOT 实现了 17.7%/12.8% 的 CER/UD-CER 平均相对降低率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Signal Processing Letters
IEEE Signal Processing Letters 工程技术-工程:电子与电气
CiteScore
7.40
自引率
12.80%
发文量
339
审稿时长
2.8 months
期刊介绍: The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信