Syllable-Based Acoustic Modeling With Lattice-Free MMI for Mandarin Speech Recognition

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI:10.1109/ISCSLP49672.2021.9362050

Jie Li, Zhiyun Fan, Xiaorui Wang, Yan Li

{"title":"Syllable-Based Acoustic Modeling With Lattice-Free MMI for Mandarin Speech Recognition","authors":"Jie Li, Zhiyun Fan, Xiaorui Wang, Yan Li","doi":"10.1109/ISCSLP49672.2021.9362050","DOIUrl":null,"url":null,"abstract":"Most automatic speech recognition (ASR) systems in past decades have used context-dependent (CD) phones as the fundamental acoustic units. However, these phone-based approaches lack an easy and efficient way for modeling long-term temporal dependencies. Compared with phone units, syllables span a longer time, typically several phones, thereby having more stable acoustic realizations. In this work, we aim to train a syllable-based acoustic model for Mandarin ASR with lattice-free maximum mutual information (LF-MMI) criterion. We expect that, the combination of longer linguistic units, the RNN-based model structure and the sequence-level objective function, can result in better modeling of long-term temporal acoustic variations. We make multiple modifications to improve the performance of syllable-based AM and benchmark our models on two large-scale databases. Experimental results show that the proposed syllable-based AM performs much better than the CD phone-based baseline, especially on noisy test sets, with faster decoding speed.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCSLP49672.2021.9362050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Most automatic speech recognition (ASR) systems in past decades have used context-dependent (CD) phones as the fundamental acoustic units. However, these phone-based approaches lack an easy and efficient way for modeling long-term temporal dependencies. Compared with phone units, syllables span a longer time, typically several phones, thereby having more stable acoustic realizations. In this work, we aim to train a syllable-based acoustic model for Mandarin ASR with lattice-free maximum mutual information (LF-MMI) criterion. We expect that, the combination of longer linguistic units, the RNN-based model structure and the sequence-level objective function, can result in better modeling of long-term temporal acoustic variations. We make multiple modifications to improve the performance of syllable-based AM and benchmark our models on two large-scale databases. Experimental results show that the proposed syllable-based AM performs much better than the CD phone-based baseline, especially on noisy test sets, with faster decoding speed.

查看原文本刊更多论文

基于音节声学建模的无格MMI普通话语音识别

在过去的几十年里，大多数自动语音识别(ASR)系统都使用上下文相关(CD)电话作为基本的声学单元。然而，这些基于手机的方法缺乏一种简单有效的方法来建模长期时间依赖性。与电话单位相比，音节跨越的时间更长，通常是几个电话，因此具有更稳定的声学实现。在这项工作中，我们的目标是训练一个基于音节的普通话ASR声学模型，该模型采用无格最大互信息(LF-MMI)准则。我们期望，结合较长的语言单元、基于rnn的模型结构和序列级目标函数，可以更好地建模长期时间声学变化。我们进行了多次修改以提高基于音节的AM的性能，并在两个大型数据库上对我们的模型进行了基准测试。实验结果表明，基于音节的调幅比基于CD电话基线的调幅性能好得多，特别是在噪声测试集上，译码速度更快。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)

自引率

0.00%

发文量