Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition

IF 3.2 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Signal Processing Letters Pub Date : 2024-10-29 DOI:10.1109/LSP.2024.3487795

Lin Zheng;Han Zhu;Sanli Tian;Qingwei Zhao;Ta Li

{"title":"Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition","authors":"Lin Zheng;Han Zhu;Sanli Tian;Qingwei Zhao;Ta Li","doi":"10.1109/LSP.2024.3487795","DOIUrl":null,"url":null,"abstract":"Serialized Output Training (SOT) has emerged as the mainstream approach for addressing the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT encounters cross-domain performance degradation which hinders its application. Meanwhile, traditional domain adaption methods may harm the accuracy of speaker change point prediction evaluated by UD-CER, which is an important metric in SOT. To solve these issues, we propose Pseudo-Labeling based SOT (PL-SOT) for domain adaptation by treating speaker change token (\n<inline-formula><tex-math>$< $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$>$</tex-math></inline-formula>\n) specially during training to increase the accuracy of speaker change point prediction. Firstly, we improve CTC loss by proposing \n<italic>Weakening and Enhancing CTC</i>\n (WE-CTC) loss to weaken the learning of error-prone labels surrounding \n<inline-formula><tex-math>$<$</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$>$</tex-math></inline-formula>\n while enhance the emission probability of \n<inline-formula><tex-math>$< $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$>$</tex-math></inline-formula>\n through modifying posteriors of the pseudo-labels. Secondly, we introduce \n<italic>Weighted Confidence Filter</i>\n (WCF) that assigns higher scores of \n<inline-formula><tex-math>$<$</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$>$</tex-math></inline-formula>\n to exclude low-quality pseudo-labels without hurting the \n<inline-formula><tex-math>$< $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$>$</tex-math></inline-formula>\n prediction. Experimental results show that PL-SOT achieves 17.7%/12.8% average relative reduction of CER/UD-CER, with AliMeeting as source domain and AISHELL-4 along with MagicData-RAMC as target domain.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3119-3123"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10737652/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Serialized Output Training (SOT) has emerged as the mainstream approach for addressing the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT encounters cross-domain performance degradation which hinders its application. Meanwhile, traditional domain adaption methods may harm the accuracy of speaker change point prediction evaluated by UD-CER, which is an important metric in SOT. To solve these issues, we propose Pseudo-Labeling based SOT (PL-SOT) for domain adaptation by treating speaker change token (

$< $

$>$

) specially during training to increase the accuracy of speaker change point prediction. Firstly, we improve CTC loss by proposing Weakening and Enhancing CTC (WE-CTC) loss to weaken the learning of error-prone labels surrounding

$<$

$>$

while enhance the emission probability of

$< $

$>$

through modifying posteriors of the pseudo-labels. Secondly, we introduce Weighted Confidence Filter (WCF) that assigns higher scores of

$<$

$>$

to exclude low-quality pseudo-labels without hurting the

$< $

$>$

prediction. Experimental results show that PL-SOT achieves 17.7%/12.8% average relative reduction of CER/UD-CER, with AliMeeting as source domain and AISHELL-4 along with MagicData-RAMC as target domain.

查看原文本刊更多论文

端到端多对话者重叠语音识别中的无监督领域自适应

序列化输出训练（SOT）因其简单易行，已成为解决多说话者重叠语音识别难题的主流方法。然而，SOT 会遇到跨域性能下降的问题，这阻碍了它的应用。同时，传统的域自适应方法可能会损害通过 UD-CER 评估的说话人变化点预测的准确性，而 UD-CER 是 SOT 的一个重要指标。为了解决这些问题，我们提出了基于伪标记的 SOT（PL-SOT）领域适应方法，在训练过程中对说话人变化标记（$< $sc$>$）进行特殊处理，以提高说话人变化点预测的准确性。首先，我们通过提出弱化和增强 CTC（Weakening and Enhancing CTC，WE-CTC）损失来改进 CTC 损失，以弱化对 $<$sc$>$ 周围易出错标签的学习，同时通过修改伪标签的后验值来增强 $< $sc$>$ 的发射概率。其次，我们引入了加权置信过滤器（WCF），在不影响 $< $sc$>$ 预测的情况下，为 $< $sc$>$ 分配更高的分数，以排除低质量的伪标签。实验结果表明，以 AliMeeting 为源域，AISHELL-4 和 MagicData-RAMC 为目标域，PL-SOT 实现了 17.7%/12.8% 的 CER/UD-CER 平均相对降低率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Signal Processing Letters 工程技术-工程：电子与电气

CiteScore

7.40

自引率

12.80%

发文量

339

审稿时长

2.8 months

期刊介绍： The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.