Acoustic-to-word model without OOV

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-11-28 DOI:10.1109/ASRU.2017.8268924

Jinyu Li, Guoli Ye, Rui Zhao, J. Droppo, Y. Gong

{"title":"Acoustic-to-word model without OOV","authors":"Jinyu Li, Guoli Ye, Rui Zhao, J. Droppo, Y. Gong","doi":"10.1109/ASRU.2017.8268924","DOIUrl":null,"url":null,"abstract":"Recently, the acoustic-to-word model based on the Connectionist Temporal Classification (CTC) criterion was shown as a natural end-to-end model directly targeting words as output units. However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node. Therefore, such word-based CTC model can only recognize the frequent words modeled by the network output nodes. It also cannot easily handle the hot-words which emerge after the model is trained. In this study, we improve the acoustic-to-word model with a hybrid CTC model which can predict both words and characters at the same time. With a shared-hidden-layer structure and modular design, the alignments of words generated from the word-based CTC and the character-based CTC are synchronized. Whenever the acoustic-to-word model emits an OOV token, we back off that OOV segment to the word output generated from the character-based CTC, hence solving the OOV or hot-words issue. Evaluated on a Microsoft Cortana voice assistant task, the proposed model can reduce the errors introduced by the OOV output token in the acoustic-to-word model by 30%.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"38 35","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268924","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

Abstract

Recently, the acoustic-to-word model based on the Connectionist Temporal Classification (CTC) criterion was shown as a natural end-to-end model directly targeting words as output units. However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node. Therefore, such word-based CTC model can only recognize the frequent words modeled by the network output nodes. It also cannot easily handle the hot-words which emerge after the model is trained. In this study, we improve the acoustic-to-word model with a hybrid CTC model which can predict both words and characters at the same time. With a shared-hidden-layer structure and modular design, the alignments of words generated from the word-based CTC and the character-based CTC are synchronized. Whenever the acoustic-to-word model emits an OOV token, we back off that OOV segment to the word output generated from the character-based CTC, hence solving the OOV or hot-words issue. Evaluated on a Microsoft Cortana voice assistant task, the proposed model can reduce the errors introduced by the OOV output token in the acoustic-to-word model by 30%.

查看原文本刊更多论文

没有OOV的声学到单词模型

最近，基于连接主义时态分类(CTC)标准的声到词模型被证明是一种直接以词为输出单元的自然端到端模型。然而，这种基于单词的CTC模型存在词汇表外(OOV)问题，因为它只能对输出层中有限数量的单词建模，并将所有剩余的单词映射到OOV输出节点。因此，这种基于词的CTC模型只能识别由网络输出节点建模的频繁词。它也不能轻易处理模型训练后出现的热词。在本研究中，我们使用混合CTC模型来改进声学-词模型，该模型可以同时预测单词和字符。采用共享隐藏层结构和模块化设计，实现了基于词的CTC和基于字符的CTC生成的词对齐同步。每当声学到单词模型发出一个OOV令牌时，我们将该OOV段退回到基于字符的CTC生成的单词输出，从而解决了OOV或热词问题。在微软Cortana语音助手任务上进行评估后，所提出的模型可以将声学到单词模型中由OOV输出令牌引入的错误减少30%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量