{"title":"Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr","authors":"Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li","doi":"10.1109/icassp.2019.8683859","DOIUrl":null,"url":null,"abstract":"The choice of acoustic modeling units is critical to acoustic modeling in large vocabulary continuous speech recognition (LVCSR) tasks. The recent connectionist temporal classification (CTC) based acoustic models have more options for the choice of modeling units. In this work, we propose a DFSMN-CTC-sMBR acoustic model and investigate various modeling units for Mandarin speech recognition. In addition to the commonly used context-independent Initial/Finals (CI-IF), context-dependent Initial/Finals (CD-IF) and Syllable, we also propose a hybrid Character-Syllable modeling units by mixing high frequency Chinese characters and syllables. Experimental results show that DFSMN-CTC-sMBR models with all these types of modeling units can significantly outperform the well-trained conventional hybrid models. Moreover, we find that the proposed hybrid Character-Syllable modeling units is the best choice for CTC based acoustic modeling for Mandarin speech recognition in our work since it can dramatically reduce substitution errors in recognition results. In a 20,000 hours Mandarin speech recognition task, the DFSMN-CTC-sMBR system with hybrid Character-Syllable achieves a character error rate (CER) of 7.45% while performance of the well-trained DFSMN-CE-sMBR system is 9.49%.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"65 1","pages":"7085-7089"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icassp.2019.8683859","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 31
Abstract
The choice of acoustic modeling units is critical to acoustic modeling in large vocabulary continuous speech recognition (LVCSR) tasks. The recent connectionist temporal classification (CTC) based acoustic models have more options for the choice of modeling units. In this work, we propose a DFSMN-CTC-sMBR acoustic model and investigate various modeling units for Mandarin speech recognition. In addition to the commonly used context-independent Initial/Finals (CI-IF), context-dependent Initial/Finals (CD-IF) and Syllable, we also propose a hybrid Character-Syllable modeling units by mixing high frequency Chinese characters and syllables. Experimental results show that DFSMN-CTC-sMBR models with all these types of modeling units can significantly outperform the well-trained conventional hybrid models. Moreover, we find that the proposed hybrid Character-Syllable modeling units is the best choice for CTC based acoustic modeling for Mandarin speech recognition in our work since it can dramatically reduce substitution errors in recognition results. In a 20,000 hours Mandarin speech recognition task, the DFSMN-CTC-sMBR system with hybrid Character-Syllable achieves a character error rate (CER) of 7.45% while performance of the well-trained DFSMN-CE-sMBR system is 9.49%.