Improving Phonetic Recognition with Sequence-length Standardized MFCC Features and Deep Bi-Directional LSTM

Toan Pham Van, Hau Nguyen Thanh, Ta Minh Thanh
{"title":"Improving Phonetic Recognition with Sequence-length Standardized MFCC Features and Deep Bi-Directional LSTM","authors":"Toan Pham Van, Hau Nguyen Thanh, Ta Minh Thanh","doi":"10.1109/NICS.2018.8606886","DOIUrl":null,"url":null,"abstract":"Phonetic recognition is one of the most challenging problems in the field of speech analysis. These applications can be mentioned such as dialect identification [1], mispronunciation detection [2], spoken document retrieval [3], and so on. There are different approaches to solve these problems such as improving the feature selection on input speech [4], applying deep learning technique [5] [6] [7] or combining both of them [8]. With the sequence data as the phonetics, the architecture which is based on recurrent neural network (RNN) is an appropriate approach [9]. It is even more powerful when combined with the improvement of features selection on input data. In our approach, we combine the Mel Frequency Cepstral Coefficients (MFCC) method with sequence-length to present the acoustic features of speech and use some RNN models to phonetic classification. Our experiments are implemented on the Texas Instruments Massachusetts Institute of Technology (TIMIT) [10] phone recognition dataset. Especially, our data processing and features selection method give consistently better results than other researches using the same neural network model. Currently, we have achieved the lowest error test rate (13.05%) by using Bidirectional LSTM, which is the best result in TIMIT dataset with the reduction of about 3.5% over the last best result [5] [6].","PeriodicalId":137666,"journal":{"name":"2018 5th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 5th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS.2018.8606886","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Phonetic recognition is one of the most challenging problems in the field of speech analysis. These applications can be mentioned such as dialect identification [1], mispronunciation detection [2], spoken document retrieval [3], and so on. There are different approaches to solve these problems such as improving the feature selection on input speech [4], applying deep learning technique [5] [6] [7] or combining both of them [8]. With the sequence data as the phonetics, the architecture which is based on recurrent neural network (RNN) is an appropriate approach [9]. It is even more powerful when combined with the improvement of features selection on input data. In our approach, we combine the Mel Frequency Cepstral Coefficients (MFCC) method with sequence-length to present the acoustic features of speech and use some RNN models to phonetic classification. Our experiments are implemented on the Texas Instruments Massachusetts Institute of Technology (TIMIT) [10] phone recognition dataset. Especially, our data processing and features selection method give consistently better results than other researches using the same neural network model. Currently, we have achieved the lowest error test rate (13.05%) by using Bidirectional LSTM, which is the best result in TIMIT dataset with the reduction of about 3.5% over the last best result [5] [6].
利用序列长度标准化MFCC特征和深度双向LSTM改进语音识别
语音识别是语音分析领域中最具挑战性的问题之一。这些应用程序可以提到,如方言识别[1],发音错误检测[2],口语文档检索[3]等。有不同的方法来解决这些问题,如改进输入语音的特征选择[4],应用深度学习技术[5][6][7]或两者结合[8]。以序列数据作为语音,基于递归神经网络(RNN)的结构是一种合适的方法。当结合对输入数据特征选择的改进时,它甚至更强大。在我们的方法中,我们将Mel频率倒谱系数(MFCC)方法与序列长度相结合来呈现语音的声学特征,并使用一些RNN模型来进行语音分类。我们的实验是在德州仪器麻省理工学院(TIMIT)[10]手机识别数据集上实现的。特别是,我们的数据处理和特征选择方法比使用相同神经网络模型的其他研究结果一致更好。目前,我们使用双向LSTM实现了最低的测试错误率(13.05%),这是TIMIT数据集中的最佳结果,比上一个最佳结果[5][6]降低了约3.5%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信