Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2019-05-12 DOI:10.1109/icassp.2019.8683859

Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li

{"title":"Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr","authors":"Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li","doi":"10.1109/icassp.2019.8683859","DOIUrl":null,"url":null,"abstract":"The choice of acoustic modeling units is critical to acoustic modeling in large vocabulary continuous speech recognition (LVCSR) tasks. The recent connectionist temporal classification (CTC) based acoustic models have more options for the choice of modeling units. In this work, we propose a DFSMN-CTC-sMBR acoustic model and investigate various modeling units for Mandarin speech recognition. In addition to the commonly used context-independent Initial/Finals (CI-IF), context-dependent Initial/Finals (CD-IF) and Syllable, we also propose a hybrid Character-Syllable modeling units by mixing high frequency Chinese characters and syllables. Experimental results show that DFSMN-CTC-sMBR models with all these types of modeling units can significantly outperform the well-trained conventional hybrid models. Moreover, we find that the proposed hybrid Character-Syllable modeling units is the best choice for CTC based acoustic modeling for Mandarin speech recognition in our work since it can dramatically reduce substitution errors in recognition results. In a 20,000 hours Mandarin speech recognition task, the DFSMN-CTC-sMBR system with hybrid Character-Syllable achieves a character error rate (CER) of 7.45% while performance of the well-trained DFSMN-CE-sMBR system is 9.49%.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"65 1","pages":"7085-7089"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icassp.2019.8683859","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

The choice of acoustic modeling units is critical to acoustic modeling in large vocabulary continuous speech recognition (LVCSR) tasks. The recent connectionist temporal classification (CTC) based acoustic models have more options for the choice of modeling units. In this work, we propose a DFSMN-CTC-sMBR acoustic model and investigate various modeling units for Mandarin speech recognition. In addition to the commonly used context-independent Initial/Finals (CI-IF), context-dependent Initial/Finals (CD-IF) and Syllable, we also propose a hybrid Character-Syllable modeling units by mixing high frequency Chinese characters and syllables. Experimental results show that DFSMN-CTC-sMBR models with all these types of modeling units can significantly outperform the well-trained conventional hybrid models. Moreover, we find that the proposed hybrid Character-Syllable modeling units is the best choice for CTC based acoustic modeling for Mandarin speech recognition in our work since it can dramatically reduce substitution errors in recognition results. In a 20,000 hours Mandarin speech recognition task, the DFSMN-CTC-sMBR system with hybrid Character-Syllable achieves a character error rate (CER) of 7.45% while performance of the well-trained DFSMN-CE-sMBR system is 9.49%.

查看原文本刊更多论文

基于Dfsmn-ctc-smbr的普通话语音识别建模单元研究

声学建模单元的选择对于大词汇量连续语音识别(LVCSR)任务中的声学建模至关重要。最近基于连接时间分类(CTC)的声学模型在建模单元的选择上有了更多的选择。在这项工作中，我们提出了DFSMN-CTC-sMBR声学模型，并研究了用于普通话语音识别的各种建模单元。除了常用的上下文无关声母(CI-IF)、上下文相关声母(CD-IF)和音节外，我们还提出了一种混合高频汉字和音节的字符-音节混合建模单元。实验结果表明，包含所有这些建模单元的DFSMN-CTC-sMBR模型都能显著优于训练良好的传统混合模型。此外，我们发现所提出的混合字音节建模单元是基于CTC声学建模的普通话语音识别的最佳选择，因为它可以显著减少识别结果中的替换错误。在2万小时的普通话语音识别任务中，字符-音节混合的DFSMN-CTC-sMBR系统的字符错误率为7.45%，而经过良好训练的DFSMN-CE-sMBR系统的错误率为9.49%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量