Austin Waters, Neeraj Gaur, Parisa Haghani, P. Moreno, Zhongdi Qu
{"title":"在多语言端到端语音识别中利用语言ID","authors":"Austin Waters, Neeraj Gaur, Parisa Haghani, P. Moreno, Zhongdi Qu","doi":"10.1109/ASRU46091.2019.9003870","DOIUrl":null,"url":null,"abstract":"Recent advances in end-to-end speech recognition have made it possible to build multilingual models, capable of recognizing speech in multiple languages. Multilingual models can outperform their monolingual counterparts, depending on the amount of training data and the relatedness of languages. However, in some cases, these models rely on having perfect knowledge of the language being spoken; that is, they expect to be provided with an external language ID that augments the input features or modulates internal layers of the network. In this paper, we introduce a novel technique for inferring the language ID in a streaming fashion using RNN-T, and a novel loss function that pressures the model to identify the language after as few frames as possible. The output of this streaming language-ID model is used in training and inference of a multilingual recognition model. We show the effectiveness of our approach through experiments on two sets of languages, one consisting of different dialects of Arabic, and the other consisting of Nordic languages, Finnish and Dutch.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Leveraging Language ID in Multilingual End-to-End Speech Recognition\",\"authors\":\"Austin Waters, Neeraj Gaur, Parisa Haghani, P. Moreno, Zhongdi Qu\",\"doi\":\"10.1109/ASRU46091.2019.9003870\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advances in end-to-end speech recognition have made it possible to build multilingual models, capable of recognizing speech in multiple languages. Multilingual models can outperform their monolingual counterparts, depending on the amount of training data and the relatedness of languages. However, in some cases, these models rely on having perfect knowledge of the language being spoken; that is, they expect to be provided with an external language ID that augments the input features or modulates internal layers of the network. In this paper, we introduce a novel technique for inferring the language ID in a streaming fashion using RNN-T, and a novel loss function that pressures the model to identify the language after as few frames as possible. The output of this streaming language-ID model is used in training and inference of a multilingual recognition model. We show the effectiveness of our approach through experiments on two sets of languages, one consisting of different dialects of Arabic, and the other consisting of Nordic languages, Finnish and Dutch.\",\"PeriodicalId\":150913,\"journal\":{\"name\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU46091.2019.9003870\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003870","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Leveraging Language ID in Multilingual End-to-End Speech Recognition
Recent advances in end-to-end speech recognition have made it possible to build multilingual models, capable of recognizing speech in multiple languages. Multilingual models can outperform their monolingual counterparts, depending on the amount of training data and the relatedness of languages. However, in some cases, these models rely on having perfect knowledge of the language being spoken; that is, they expect to be provided with an external language ID that augments the input features or modulates internal layers of the network. In this paper, we introduce a novel technique for inferring the language ID in a streaming fashion using RNN-T, and a novel loss function that pressures the model to identify the language after as few frames as possible. The output of this streaming language-ID model is used in training and inference of a multilingual recognition model. We show the effectiveness of our approach through experiments on two sets of languages, one consisting of different dialects of Arabic, and the other consisting of Nordic languages, Finnish and Dutch.