Yuanfeng Song, Xiaoling Huang, Xuefang Zhao, Di Jiang, Raymond Chi-Wing Wong
{"title":"混合语音识别中弱监督预训练的多模态N-best列表评分","authors":"Yuanfeng Song, Xiaoling Huang, Xuefang Zhao, Di Jiang, Raymond Chi-Wing Wong","doi":"10.1109/ICDM51629.2021.00167","DOIUrl":null,"url":null,"abstract":"N-best list rescoring, an essential step in hybrid automatic speech recognition (ASR), aims to re-evaluate the N-best hypothesis list decoded by the acoustic model (AM) and language model (LM), and selects the top-ranked hypotheses as the final ASR results. This paper explores the performance of neural rescoring models in scenarios where large-scale training labels are not available. We propose a weakly supervised neural rescoring method, WSNeuRescore, where a listwise multimodal neural rescoring model is pre-trained using labels automatically obtained without human annotators. Specifically, we employ the output of an unsupervised rescoring model, the weighted linear combination of the AM score and the LM score, as a weak supervision signal to pre-train the neural rescoring model. Our experimental evaluations on a public dataset validate that the pre-trained rescoring model based on weakly supervised data leads to an impressive performance. In the extreme scenario without any high-quality labeled data, it achieves up to an 11.90% WER reduction and a 15.56% NDCG@10 improvement over the baseline method in Kaldi, a well-known open-source toolkit in the ASR community.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal N-best List Rescoring with Weakly Supervised Pre-training in Hybrid Speech Recognition\",\"authors\":\"Yuanfeng Song, Xiaoling Huang, Xuefang Zhao, Di Jiang, Raymond Chi-Wing Wong\",\"doi\":\"10.1109/ICDM51629.2021.00167\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"N-best list rescoring, an essential step in hybrid automatic speech recognition (ASR), aims to re-evaluate the N-best hypothesis list decoded by the acoustic model (AM) and language model (LM), and selects the top-ranked hypotheses as the final ASR results. This paper explores the performance of neural rescoring models in scenarios where large-scale training labels are not available. We propose a weakly supervised neural rescoring method, WSNeuRescore, where a listwise multimodal neural rescoring model is pre-trained using labels automatically obtained without human annotators. Specifically, we employ the output of an unsupervised rescoring model, the weighted linear combination of the AM score and the LM score, as a weak supervision signal to pre-train the neural rescoring model. Our experimental evaluations on a public dataset validate that the pre-trained rescoring model based on weakly supervised data leads to an impressive performance. In the extreme scenario without any high-quality labeled data, it achieves up to an 11.90% WER reduction and a 15.56% NDCG@10 improvement over the baseline method in Kaldi, a well-known open-source toolkit in the ASR community.\",\"PeriodicalId\":320970,\"journal\":{\"name\":\"2021 IEEE International Conference on Data Mining (ICDM)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Data Mining (ICDM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM51629.2021.00167\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM51629.2021.00167","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
n -最佳列表评分是混合自动语音识别(ASR)的重要步骤,其目的是对声学模型(AM)和语言模型(LM)解码的n -最佳假设列表进行重新评估,并选择排名靠前的假设作为最终的ASR结果。本文探讨了神经评分模型在没有大规模训练标签的情况下的性能。我们提出了一种弱监督神经评分方法WSNeuRescore,其中使用自动获得的标签预训练列表多模态神经评分模型,而无需人工注释器。具体而言,我们采用无监督评分模型的输出,即AM评分和LM评分的加权线性组合,作为弱监督信号来预训练神经评分模型。我们在公共数据集上的实验评估验证了基于弱监督数据的预训练评分模型带来了令人印象深刻的性能。在没有任何高质量标记数据的极端情况下,与Kaldi (ASR社区中著名的开源工具包)中的基线方法相比,它实现了高达11.90%的WER降低和15.56% NDCG@10改进。
Multimodal N-best List Rescoring with Weakly Supervised Pre-training in Hybrid Speech Recognition
N-best list rescoring, an essential step in hybrid automatic speech recognition (ASR), aims to re-evaluate the N-best hypothesis list decoded by the acoustic model (AM) and language model (LM), and selects the top-ranked hypotheses as the final ASR results. This paper explores the performance of neural rescoring models in scenarios where large-scale training labels are not available. We propose a weakly supervised neural rescoring method, WSNeuRescore, where a listwise multimodal neural rescoring model is pre-trained using labels automatically obtained without human annotators. Specifically, we employ the output of an unsupervised rescoring model, the weighted linear combination of the AM score and the LM score, as a weak supervision signal to pre-train the neural rescoring model. Our experimental evaluations on a public dataset validate that the pre-trained rescoring model based on weakly supervised data leads to an impressive performance. In the extreme scenario without any high-quality labeled data, it achieves up to an 11.90% WER reduction and a 15.56% NDCG@10 improvement over the baseline method in Kaldi, a well-known open-source toolkit in the ASR community.