Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation

Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering Pub Date : 2021-10-22 DOI:10.1145/3501409.3501602

Adwoa Agyeiwaa Boakye-Yiadom, Mingwei Qin, Ren Jing

{"title":"Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation","authors":"Adwoa Agyeiwaa Boakye-Yiadom, Mingwei Qin, Ren Jing","doi":"10.1145/3501409.3501602","DOIUrl":null,"url":null,"abstract":"This paper presents a new way of building low-resourced dialect Automatic Speech Recognition (ASR) systems using a small database using the Asante-Twi dialect. Three different ASR systems with different features and methods have been tested and tried using the Kaldi toolkit. For the first and second Asante-Twi ASR systems, Mel Frequency Cepstral Coefficients (MFCC) feature extraction method was used with different context dependent parameters for each and the Perceptual Linear Prediction(PLP) feature extraction method was used for the third ASR system. To enhance the performance of the ASR systems, all feature extraction methods of the systems are improved using Cepstral Mean and Variance Normalization (CMVN) and Delta(&Dgr;) dynamic features. In addition, the acoustic model unit of each ASR system using the GMM-HMM pattern classifier algorithm has been improved by training two context dependent (triphone) models, one on top of the other, and both on top of context independent (monophone) models to deliver better performance. Word Error Rate(WER) is used as metrics for measuring the accuracy performance of the systems. As the correct parameter setting for triphone models were used, the second ASR system saw about 50% reduction in %WER for the first triphone model and about 10% reduction in %WER values for the second triphone model as compared to the first ASR system. Decoding results show that the second ASR system was the most efficient system of all the ASR systems in percent WER because it produced the lowest value of 5.15% WER obtained from context dependent triphone models. The third ASR system, using the same triphone parameters as the second ASR, was the worst performing of all three. Thus, MFFCs are found to be the most suitable feature extraction technique when using noise-free data with context-dependent acoustic models being the best method for GMM-HMM acoustic modeling on a limited amount of data.","PeriodicalId":191106,"journal":{"name":"Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3501409.3501602","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

This paper presents a new way of building low-resourced dialect Automatic Speech Recognition (ASR) systems using a small database using the Asante-Twi dialect. Three different ASR systems with different features and methods have been tested and tried using the Kaldi toolkit. For the first and second Asante-Twi ASR systems, Mel Frequency Cepstral Coefficients (MFCC) feature extraction method was used with different context dependent parameters for each and the Perceptual Linear Prediction(PLP) feature extraction method was used for the third ASR system. To enhance the performance of the ASR systems, all feature extraction methods of the systems are improved using Cepstral Mean and Variance Normalization (CMVN) and Delta(&Dgr;) dynamic features. In addition, the acoustic model unit of each ASR system using the GMM-HMM pattern classifier algorithm has been improved by training two context dependent (triphone) models, one on top of the other, and both on top of context independent (monophone) models to deliver better performance. Word Error Rate(WER) is used as metrics for measuring the accuracy performance of the systems. As the correct parameter setting for triphone models were used, the second ASR system saw about 50% reduction in %WER for the first triphone model and about 10% reduction in %WER values for the second triphone model as compared to the first ASR system. Decoding results show that the second ASR system was the most efficient system of all the ASR systems in percent WER because it produced the lowest value of 5.15% WER obtained from context dependent triphone models. The third ASR system, using the same triphone parameters as the second ASR, was the worst performing of all three. Thus, MFFCs are found to be the most suitable feature extraction technique when using noise-free data with context-dependent acoustic models being the best method for GMM-HMM acoustic modeling on a limited amount of data.

查看原文本刊更多论文

面向翻译的阿散特方言语音自动识别研究

本文提出了一种利用小数据库构建低资源方言自动语音识别系统的新方法。三种不同的ASR系统具有不同的功能和方法，已经使用Kaldi工具包进行了测试和尝试。对于第一和第二Asante-Twi ASR系统，分别使用不同上下文相关参数的Mel频率倒谱系数(MFCC)特征提取方法，第三ASR系统使用感知线性预测(PLP)特征提取方法。为了提高ASR系统的性能，采用倒谱均值和方差归一化(CMVN)和Delta(&Dgr;)动态特征对系统的所有特征提取方法进行了改进。此外，使用GMM-HMM模式分类器算法的每个ASR系统的声学模型单元通过训练两个上下文相关(三声道)模型来改进，一个在另一个之上，两个都在上下文无关(单声道)模型之上，以提供更好的性能。单词错误率(WER)是用来衡量系统准确性性能的指标。由于使用了正确的三重奏模型参数设置，与第一个三重奏系统相比，第二个三重奏系统的%WER值降低了约50%，第二个三重奏系统的%WER值降低了约10%。解码结果表明，第二个ASR系统是所有ASR系统中最有效的系统，因为它产生的5.15%的WER值是由上下文相关的三音器模型获得的最低的。第三个ASR系统，使用与第二个ASR相同的三音参数，是所有三个系统中表现最差的。因此，当使用无噪声数据时，发现MFFCs是最合适的特征提取技术，而上下文相关声学模型是在有限数量数据上进行GMM-HMM声学建模的最佳方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering

自引率

0.00%

发文量