Self-Adaptive Multilingual ASR Rescoring with Language Identification and Unified Language Model

The Speaker and Language Recognition Workshop Pub Date : 2022-06-28 DOI:10.21437/odyssey.2022-58

Zhuo Gong, D. Saito, Longfei Yang, T. Shinozaki, Sheng Li, H. Kawai, N. Minematsu

{"title":"Self-Adaptive Multilingual ASR Rescoring with Language Identification and Unified Language Model","authors":"Zhuo Gong, D. Saito, Longfei Yang, T. Shinozaki, Sheng Li, H. Kawai, N. Minematsu","doi":"10.21437/odyssey.2022-58","DOIUrl":null,"url":null,"abstract":"Language Models (LM) can be used in automatic speech recognition (ASR) rescoring to select the hypothesis with the fewest errors. While in multilingual ASR, multiple LMs might be used based on language identification (LID) given by the multilingual ASR outputs. However, in the traditional shallow fusion method, a static LM weight is determined by a development set. This static weight might not fulfill the situations of all languages in test data. And for multiple LMs, different weight needs to be searched for each LM. Instead, A unified multilingual LM will receive a LID token at the beginning of its auto-regressive predicting to decide which language to decode, so that merely one weight is necessary for LM rescoring. Then, we propose a multilingual ASR rescoring method which dynamically tunes the LM weight during decoding to optimize the balance between the end-to-end (E2E) multilingual ASR model and the LM according to the LM’s entropy and logits score as model confidence metrics. With this method, resources for search the best hyperparameter LM weight can also be saved. The experiments are mainly conducted on Common voice and Voxforge corpora. The results show that this method can reach the performance of the best static LM weight and even defeat it in several languages with no hyperparameter to be tuned and nearly zero overhead.","PeriodicalId":315750,"journal":{"name":"The Speaker and Language Recognition Workshop","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Speaker and Language Recognition Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/odyssey.2022-58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Language Models (LM) can be used in automatic speech recognition (ASR) rescoring to select the hypothesis with the fewest errors. While in multilingual ASR, multiple LMs might be used based on language identification (LID) given by the multilingual ASR outputs. However, in the traditional shallow fusion method, a static LM weight is determined by a development set. This static weight might not fulfill the situations of all languages in test data. And for multiple LMs, different weight needs to be searched for each LM. Instead, A unified multilingual LM will receive a LID token at the beginning of its auto-regressive predicting to decide which language to decode, so that merely one weight is necessary for LM rescoring. Then, we propose a multilingual ASR rescoring method which dynamically tunes the LM weight during decoding to optimize the balance between the end-to-end (E2E) multilingual ASR model and the LM according to the LM’s entropy and logits score as model confidence metrics. With this method, resources for search the best hyperparameter LM weight can also be saved. The experiments are mainly conducted on Common voice and Voxforge corpora. The results show that this method can reach the performance of the best static LM weight and even defeat it in several languages with no hyperparameter to be tuned and nearly zero overhead.

查看原文本刊更多论文

基于语言识别和统一语言模型的自适应多语言ASR评分

语言模型(LM)可以用于自动语音识别(ASR)评分，以选择误差最小的假设。而在多语言ASR中，基于多语言ASR输出给出的语言识别(LID)，可以使用多个lm。然而，在传统的浅融合方法中，静态LM权值是由开发集确定的。这种静态权重可能无法满足测试数据中所有语言的情况。对于多个LM，每个LM需要搜索不同的权值。相反，统一的多语言LM将在其自动回归预测开始时接收一个LID令牌，以决定解码哪种语言，因此LM评分只需要一个权重。然后，我们提出了一种多语言ASR评分方法，该方法在解码过程中动态调整LM权重，以LM的熵和logits分数作为模型置信度指标，优化端到端(E2E)多语言ASR模型和LM之间的平衡。该方法还可以节省用于搜索最佳超参数LM权值的资源。实验主要在Common voice和Voxforge语料库上进行。结果表明，该方法可以达到静态LM权值的最佳性能，甚至在几种语言中优于静态LM权值，无需超参数调优，开销几乎为零。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Speaker and Language Recognition Workshop

自引率

0.00%

发文量