A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI:10.1109/ASRU46091.2019.9003790

E. McDermott, H. Sak, Ehsan Variani

{"title":"A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition","authors":"E. McDermott, H. Sak, Ehsan Variani","doi":"10.1109/ASRU46091.2019.9003790","DOIUrl":null,"url":null,"abstract":"This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR). Applied to a Recurrent Neural Network Transducer (RNN-T) ASR model trained on a given domain, a matched in-domain RNN-LM, and a target domain RNN-LM, the proposed method uses Bayes' Rule to define RNN-T posteriors for the target domain, in a manner directly analogous to the classic hybrid model for ASR based on Deep Neural Networks (DNNs) or LSTMs in the Hidden Markov Model (HMM) framework (Bourlard & Morgan, 1994). The proposed approach is evaluated in cross-domain and limited-data scenarios, for which a significant amount of target domain text data is used for LM training, but only limited (or no) {audio, transcript} training data pairs are used to train the RNN-T. Specifically, an RNN-T model trained on paired audio & transcript data from YouTube is evaluated for its ability to generalize to Voice Search data. The Density Ratio method was found to consistently outperform the dominant approach to LM and end-to-end ASR integration, Shallow Fusion.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"94","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 94

Abstract

This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR). Applied to a Recurrent Neural Network Transducer (RNN-T) ASR model trained on a given domain, a matched in-domain RNN-LM, and a target domain RNN-LM, the proposed method uses Bayes' Rule to define RNN-T posteriors for the target domain, in a manner directly analogous to the classic hybrid model for ASR based on Deep Neural Networks (DNNs) or LSTMs in the Hidden Markov Model (HMM) framework (Bourlard & Morgan, 1994). The proposed approach is evaluated in cross-domain and limited-data scenarios, for which a significant amount of target domain text data is used for LM training, but only limited (or no) {audio, transcript} training data pairs are used to train the RNN-T. Specifically, an RNN-T model trained on paired audio & transcript data from YouTube is evaluated for its ability to generalize to Voice Search data. The Density Ratio method was found to consistently outperform the dominant approach to LM and end-to-end ASR integration, Shallow Fusion.

查看原文本刊更多论文

端到端自动语音识别中语言模型融合的密度比方法

本文描述了一种将外部语言模型(lm)集成到端到端自动语音识别(ASR)模型中的密度比方法。应用于在给定域上训练的递归神经网络传感器(RNN-T) ASR模型、匹配的域内RNN-LM和目标域RNN-LM，所提出的方法使用贝叶斯规则定义目标域的RNN-T后验，其方式直接类似于基于深度神经网络(dnn)或隐马尔可夫模型(HMM)框架中的LSTMs的ASR经典混合模型(Bourlard & Morgan, 1994)。该方法在跨领域和有限数据场景下进行了评估，其中大量的目标领域文本数据用于LM训练，但只有有限(或没有){音频，成绩单}训练数据对用于训练RNN-T。具体来说，对来自YouTube的配对音频和文本数据进行训练的RNN-T模型进行了评估，以评估其泛化到语音搜索数据的能力。发现密度比方法始终优于LM和端到端ASR集成(浅融合)的主导方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量