基于CTC-CNN模型的语音识别

IF 1.7

Computers, materials & continua Pub Date : 2023-01-01 DOI:10.32604/cmc.2023.040024

Wen-Tsai Sung, Hao-Wei Kang, Sung-Jung Hsiao

{"title":"基于CTC-CNN模型的语音识别","authors":"Wen-Tsai Sung, Hao-Wei Kang, Sung-Jung Hsiao","doi":"10.32604/cmc.2023.040024","DOIUrl":null,"url":null,"abstract":"In the speech recognition system, the acoustic model is an important underlying model, and its accuracy directly affects the performance of the entire system. This paper introduces the construction and training process of the acoustic model in detail and studies the Connectionist temporal classification (CTC) algorithm, which plays an important role in the end-to-end framework, established a convolutional neural network (CNN) combined with an acoustic model of Connectionist temporal classification to improve the accuracy of speech recognition. This study uses a sound sensor, ReSpeaker Mic Array v2.0.1, to convert the collected speech signals into text or corresponding speech signals to improve communication and reduce noise and hardware interference. The baseline acoustic model in this study faces challenges such as long training time, high error rate, and a certain degree of overfitting. The model is trained through continuous design and improvement of the relevant parameters of the acoustic model, and finally the performance is selected according to the evaluation index. Excellent model, which reduces the error rate to about 18%, thus improving the accuracy rate. Finally, comparative verification was carried out from the selection of acoustic feature parameters, the selection of modeling units, and the speaker’s speech rate, which further verified the excellent performance of the CTCCNN_5 + BN + Residual model structure. In terms of experiments, to train and verify the CTC-CNN baseline acoustic model, this study uses THCHS-30 and ST-CMDS speech data sets as training data sets, and after 54 epochs of training, the word error rate of the acoustic model training set is 31%, the word error rate of the test set is stable at about 43%. This experiment also considers the surrounding environmental noise. Under the noise level of 80∼90 dB, the accuracy rate is 88.18%, which is the worst performance among all levels. In contrast, at 40–60 dB, the accuracy was as high as 97.33% due to less noise pollution.","PeriodicalId":93535,"journal":{"name":"Computers, materials & continua","volume":"11 1","pages":"0"},"PeriodicalIF":1.7000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Speech Recognition via CTC-CNN Model\",\"authors\":\"Wen-Tsai Sung, Hao-Wei Kang, Sung-Jung Hsiao\",\"doi\":\"10.32604/cmc.2023.040024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the speech recognition system, the acoustic model is an important underlying model, and its accuracy directly affects the performance of the entire system. This paper introduces the construction and training process of the acoustic model in detail and studies the Connectionist temporal classification (CTC) algorithm, which plays an important role in the end-to-end framework, established a convolutional neural network (CNN) combined with an acoustic model of Connectionist temporal classification to improve the accuracy of speech recognition. This study uses a sound sensor, ReSpeaker Mic Array v2.0.1, to convert the collected speech signals into text or corresponding speech signals to improve communication and reduce noise and hardware interference. The baseline acoustic model in this study faces challenges such as long training time, high error rate, and a certain degree of overfitting. The model is trained through continuous design and improvement of the relevant parameters of the acoustic model, and finally the performance is selected according to the evaluation index. Excellent model, which reduces the error rate to about 18%, thus improving the accuracy rate. Finally, comparative verification was carried out from the selection of acoustic feature parameters, the selection of modeling units, and the speaker’s speech rate, which further verified the excellent performance of the CTCCNN_5 + BN + Residual model structure. In terms of experiments, to train and verify the CTC-CNN baseline acoustic model, this study uses THCHS-30 and ST-CMDS speech data sets as training data sets, and after 54 epochs of training, the word error rate of the acoustic model training set is 31%, the word error rate of the test set is stable at about 43%. This experiment also considers the surrounding environmental noise. Under the noise level of 80∼90 dB, the accuracy rate is 88.18%, which is the worst performance among all levels. In contrast, at 40–60 dB, the accuracy was as high as 97.33% due to less noise pollution.\",\"PeriodicalId\":93535,\"journal\":{\"name\":\"Computers, materials & continua\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers, materials & continua\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32604/cmc.2023.040024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers, materials & continua","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32604/cmc.2023.040024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在语音识别系统中，声学模型是一个重要的底层模型，其准确性直接影响到整个系统的性能。本文详细介绍了声学模型的构建和训练过程，并对端到端框架中起重要作用的联结时间分类(Connectionist temporal classification, CTC)算法进行了研究，建立了卷积神经网络(convolutional neural network, CNN)与联结时间分类声学模型相结合，以提高语音识别的准确率。本研究采用声音传感器ReSpeaker Mic Array v2.0.1，将采集到的语音信号转换为文本或相应的语音信号，提高通信效率，减少噪声和硬件干扰。本研究的基线声学模型存在训练时间长、错误率高、过拟合等问题。通过对声学模型相关参数的不断设计和改进对模型进行训练，最后根据评价指标对性能进行选择。优秀的模型，将错误率降低到18%左右，从而提高了准确率。最后从声学特征参数的选择、建模单元的选择、说话人的语速等方面进行对比验证，进一步验证了CTCCNN_5 + BN +残差模型结构的优异性能。实验方面，为了训练和验证CTC-CNN基线声学模型，本研究使用THCHS-30和ST-CMDS语音数据集作为训练数据集，经过54次epoch的训练，声学模型训练集的单词错误率为31%，测试集的单词错误率稳定在43%左右。本实验还考虑了周围环境噪声。在80 ~ 90 dB噪声水平下，准确率为88.18%，是所有噪声水平中表现最差的。相比之下，在40-60 dB时，由于噪声污染较小，精度高达97.33%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speech Recognition via CTC-CNN Model

In the speech recognition system, the acoustic model is an important underlying model, and its accuracy directly affects the performance of the entire system. This paper introduces the construction and training process of the acoustic model in detail and studies the Connectionist temporal classification (CTC) algorithm, which plays an important role in the end-to-end framework, established a convolutional neural network (CNN) combined with an acoustic model of Connectionist temporal classification to improve the accuracy of speech recognition. This study uses a sound sensor, ReSpeaker Mic Array v2.0.1, to convert the collected speech signals into text or corresponding speech signals to improve communication and reduce noise and hardware interference. The baseline acoustic model in this study faces challenges such as long training time, high error rate, and a certain degree of overfitting. The model is trained through continuous design and improvement of the relevant parameters of the acoustic model, and finally the performance is selected according to the evaluation index. Excellent model, which reduces the error rate to about 18%, thus improving the accuracy rate. Finally, comparative verification was carried out from the selection of acoustic feature parameters, the selection of modeling units, and the speaker’s speech rate, which further verified the excellent performance of the CTCCNN_5 + BN + Residual model structure. In terms of experiments, to train and verify the CTC-CNN baseline acoustic model, this study uses THCHS-30 and ST-CMDS speech data sets as training data sets, and after 54 epochs of training, the word error rate of the acoustic model training set is 31%, the word error rate of the test set is stable at about 43%. This experiment also considers the surrounding environmental noise. Under the noise level of 80∼90 dB, the accuracy rate is 88.18%, which is the worst performance among all levels. In contrast, at 40–60 dB, the accuracy was as high as 97.33% due to less noise pollution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computers, materials & continua

自引率

0.00%

发文量