{"title":"Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation","authors":"Adwoa Agyeiwaa Boakye-Yiadom, Mingwei Qin, Ren Jing","doi":"10.1145/3501409.3501602","DOIUrl":null,"url":null,"abstract":"This paper presents a new way of building low-resourced dialect Automatic Speech Recognition (ASR) systems using a small database using the Asante-Twi dialect. Three different ASR systems with different features and methods have been tested and tried using the Kaldi toolkit. For the first and second Asante-Twi ASR systems, Mel Frequency Cepstral Coefficients (MFCC) feature extraction method was used with different context dependent parameters for each and the Perceptual Linear Prediction(PLP) feature extraction method was used for the third ASR system. To enhance the performance of the ASR systems, all feature extraction methods of the systems are improved using Cepstral Mean and Variance Normalization (CMVN) and Delta(&Dgr;) dynamic features. In addition, the acoustic model unit of each ASR system using the GMM-HMM pattern classifier algorithm has been improved by training two context dependent (triphone) models, one on top of the other, and both on top of context independent (monophone) models to deliver better performance. Word Error Rate(WER) is used as metrics for measuring the accuracy performance of the systems. As the correct parameter setting for triphone models were used, the second ASR system saw about 50% reduction in %WER for the first triphone model and about 10% reduction in %WER values for the second triphone model as compared to the first ASR system. Decoding results show that the second ASR system was the most efficient system of all the ASR systems in percent WER because it produced the lowest value of 5.15% WER obtained from context dependent triphone models. The third ASR system, using the same triphone parameters as the second ASR, was the worst performing of all three. Thus, MFFCs are found to be the most suitable feature extraction technique when using noise-free data with context-dependent acoustic models being the best method for GMM-HMM acoustic modeling on a limited amount of data.","PeriodicalId":191106,"journal":{"name":"Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3501409.3501602","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
This paper presents a new way of building low-resourced dialect Automatic Speech Recognition (ASR) systems using a small database using the Asante-Twi dialect. Three different ASR systems with different features and methods have been tested and tried using the Kaldi toolkit. For the first and second Asante-Twi ASR systems, Mel Frequency Cepstral Coefficients (MFCC) feature extraction method was used with different context dependent parameters for each and the Perceptual Linear Prediction(PLP) feature extraction method was used for the third ASR system. To enhance the performance of the ASR systems, all feature extraction methods of the systems are improved using Cepstral Mean and Variance Normalization (CMVN) and Delta(&Dgr;) dynamic features. In addition, the acoustic model unit of each ASR system using the GMM-HMM pattern classifier algorithm has been improved by training two context dependent (triphone) models, one on top of the other, and both on top of context independent (monophone) models to deliver better performance. Word Error Rate(WER) is used as metrics for measuring the accuracy performance of the systems. As the correct parameter setting for triphone models were used, the second ASR system saw about 50% reduction in %WER for the first triphone model and about 10% reduction in %WER values for the second triphone model as compared to the first ASR system. Decoding results show that the second ASR system was the most efficient system of all the ASR systems in percent WER because it produced the lowest value of 5.15% WER obtained from context dependent triphone models. The third ASR system, using the same triphone parameters as the second ASR, was the worst performing of all three. Thus, MFFCs are found to be the most suitable feature extraction technique when using noise-free data with context-dependent acoustic models being the best method for GMM-HMM acoustic modeling on a limited amount of data.