Speaker and Language Aware Training for End-to-End ASR

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI:10.1109/ASRU46091.2019.9004000

Shubham Bansal, Karan Malhotra, Sriram Ganapathy

{"title":"Speaker and Language Aware Training for End-to-End ASR","authors":"Shubham Bansal, Karan Malhotra, Sriram Ganapathy","doi":"10.1109/ASRU46091.2019.9004000","DOIUrl":null,"url":null,"abstract":"The end-to-end (E2E) approach to automatic speech recognition (ASR) is a simplified and an elegant approach where a single deep neural network model directly converts the acoustic feature sequence to the text sequence. The current approach to end-to-end ASR uses the neural network model (trained with sequence loss) along with an external character/word based language model (LM) in a decoding pass to output the text sequence. In this work, we propose a new objective function for end-to-end ASR training where the LM score is explicitly introduced in the attention model loss function without any additional training parameters. In this manner, the neural network is made LM aware and this simplifies the model training process. We also propose to incorporate an attention based sequence summary feature in the ASR model which allows the system to be speaker aware. With several E2E ASR experiments on TED-LIUM, WSJ and Librispeech datasets, we show that the proposed speaker and LM aware training improves the ASR performance significantly over the state-of-art E2E approaches. We achieve the best published results reported for WSJ dataset.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9004000","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

The end-to-end (E2E) approach to automatic speech recognition (ASR) is a simplified and an elegant approach where a single deep neural network model directly converts the acoustic feature sequence to the text sequence. The current approach to end-to-end ASR uses the neural network model (trained with sequence loss) along with an external character/word based language model (LM) in a decoding pass to output the text sequence. In this work, we propose a new objective function for end-to-end ASR training where the LM score is explicitly introduced in the attention model loss function without any additional training parameters. In this manner, the neural network is made LM aware and this simplifies the model training process. We also propose to incorporate an attention based sequence summary feature in the ASR model which allows the system to be speaker aware. With several E2E ASR experiments on TED-LIUM, WSJ and Librispeech datasets, we show that the proposed speaker and LM aware training improves the ASR performance significantly over the state-of-art E2E approaches. We achieve the best published results reported for WSJ dataset.

查看原文本刊更多论文

端到端ASR的说话者和语言意识训练

自动语音识别(ASR)的端到端(E2E)方法是一种简化和优雅的方法，其中单个深度神经网络模型直接将声学特征序列转换为文本序列。目前的端到端ASR方法使用神经网络模型(用序列损失训练)以及解码传递中的外部基于字符/单词的语言模型(LM)来输出文本序列。在这项工作中，我们提出了一个新的端到端ASR训练目标函数，其中LM分数被明确地引入到注意力模型损失函数中，而不需要任何额外的训练参数。通过这种方式，神经网络被LM感知，从而简化了模型训练过程。我们还建议在ASR模型中加入一个基于注意力的序列汇总特征，使系统能够感知说话人。通过在TED-LIUM、WSJ和librisspeech数据集上进行的几次E2E自动语音识别实验，我们表明，与目前最先进的E2E方法相比，我们提出的演讲者和LM感知训练显著提高了自动语音识别的性能。我们获得了华尔街日报数据集报道的最佳发表结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量