基于固定字符级滤波器的卷积神经网络哈萨克语语音识别模型的开发

IF 4.4 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data and Cognitive Computing Pub Date : 2023-07-20 DOI:10.3390/bdcc7030132

N. Kadyrbek, M. Mansurova, A. Shomanov, G. Makharova

{"title":"基于固定字符级滤波器的卷积神经网络哈萨克语语音识别模型的开发","authors":"N. Kadyrbek, M. Mansurova, A. Shomanov, G. Makharova","doi":"10.3390/bdcc7030132","DOIUrl":null,"url":null,"abstract":"This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed audio corpus, and the use of deep neural networks for speech modeling. A high-quality decoded audio corpus was collected, containing 554 h of data, giving an idea of the frequencies of letters and syllables, as well as demographic parameters such as the gender, age, and region of residence of native speakers. The corpus contains a universal vocabulary and serves as a valuable resource for the development of modules related to speech. Machine learning experiments were conducted using the DeepSpeech2 model, which includes a sequence-to-sequence architecture with an encoder, decoder, and attention mechanism. To increase the reliability of the model, filters initialized with symbol-level embeddings were introduced to reduce the dependence on accurate positioning on object maps. The training process included simultaneous preparation of convolutional filters for spectrograms and symbolic objects. The proposed approach, using a combination of supervised and unsupervised learning methods, resulted in a 66.7% reduction in the weight of the model while maintaining relative accuracy. The evaluation on the test sample showed a 7.6% lower character error rate (CER) compared to existing models, demonstrating its most modern characteristics. The proposed architecture provides deployment on platforms with limited resources. Overall, this study presents a high-quality audio corpus, an improved speech recognition model, and promising results applicable to speech-related applications and languages beyond Kazakh.","PeriodicalId":36397,"journal":{"name":"Big Data and Cognitive Computing","volume":" ","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2023-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters\",\"authors\":\"N. Kadyrbek, M. Mansurova, A. Shomanov, G. Makharova\",\"doi\":\"10.3390/bdcc7030132\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed audio corpus, and the use of deep neural networks for speech modeling. A high-quality decoded audio corpus was collected, containing 554 h of data, giving an idea of the frequencies of letters and syllables, as well as demographic parameters such as the gender, age, and region of residence of native speakers. The corpus contains a universal vocabulary and serves as a valuable resource for the development of modules related to speech. Machine learning experiments were conducted using the DeepSpeech2 model, which includes a sequence-to-sequence architecture with an encoder, decoder, and attention mechanism. To increase the reliability of the model, filters initialized with symbol-level embeddings were introduced to reduce the dependence on accurate positioning on object maps. The training process included simultaneous preparation of convolutional filters for spectrograms and symbolic objects. The proposed approach, using a combination of supervised and unsupervised learning methods, resulted in a 66.7% reduction in the weight of the model while maintaining relative accuracy. The evaluation on the test sample showed a 7.6% lower character error rate (CER) compared to existing models, demonstrating its most modern characteristics. The proposed architecture provides deployment on platforms with limited resources. Overall, this study presents a high-quality audio corpus, an improved speech recognition model, and promising results applicable to speech-related applications and languages beyond Kazakh.\",\"PeriodicalId\":36397,\"journal\":{\"name\":\"Big Data and Cognitive Computing\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2023-07-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Big Data and Cognitive Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/bdcc7030132\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data and Cognitive Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/bdcc7030132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

本研究致力于在动态变化的条件下哈萨克语中人类语音的转录。它讨论了与哈萨克语语音结构相关的关键方面，收集转录音频语料库的技术考虑，以及深度神经网络用于语音建模。收集了一个高质量的解码音频语料库，包含554小时的数据，了解了字母和音节的频率，以及母语人士的性别、年龄和居住地区等人口统计参数。语料库包含通用词汇，是开发语音相关模块的宝贵资源。机器学习实验是使用DeepSpeech2模型进行的，该模型包括带有编码器、解码器和注意力机制的序列到序列架构。为了提高模型的可靠性，引入了用符号级嵌入初始化的滤波器，以减少对对象地图精确定位的依赖。训练过程包括同时为声谱图和符号对象准备卷积滤波器。所提出的方法结合了有监督和无监督的学习方法，在保持相对准确性的同时，使模型的权重降低了66.7%。对测试样本的评估显示，与现有模型相比，字符错误率（CER）降低了7.6%，展示了其最现代的特征。所提出的体系结构在资源有限的平台上进行部署。总的来说，这项研究提供了一个高质量的音频语料库，一个改进的语音识别模型，以及适用于哈萨克语以外的语音相关应用和语言的有希望的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters

This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed audio corpus, and the use of deep neural networks for speech modeling. A high-quality decoded audio corpus was collected, containing 554 h of data, giving an idea of the frequencies of letters and syllables, as well as demographic parameters such as the gender, age, and region of residence of native speakers. The corpus contains a universal vocabulary and serves as a valuable resource for the development of modules related to speech. Machine learning experiments were conducted using the DeepSpeech2 model, which includes a sequence-to-sequence architecture with an encoder, decoder, and attention mechanism. To increase the reliability of the model, filters initialized with symbol-level embeddings were introduced to reduce the dependence on accurate positioning on object maps. The training process included simultaneous preparation of convolutional filters for spectrograms and symbolic objects. The proposed approach, using a combination of supervised and unsupervised learning methods, resulted in a 66.7% reduction in the weight of the model while maintaining relative accuracy. The evaluation on the test sample showed a 7.6% lower character error rate (CER) compared to existing models, demonstrating its most modern characteristics. The proposed architecture provides deployment on platforms with limited resources. Overall, this study presents a high-quality audio corpus, an improved speech recognition model, and promising results applicable to speech-related applications and languages beyond Kazakh.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊