基于RNN-T的哈萨克语语音识别端到端模型

2021 3rd International Conference on Computer Communication and the Internet (ICCCI) Pub Date : 2021-06-25 DOI:10.1109/ICCCI51764.2021.9486811

O. Mamyrbayev, Dina Oralbekova, A. Kydyrbekova, Tolganay Turdalykyzy, A. Bekarystankyzy

{"title":"基于RNN-T的哈萨克语语音识别端到端模型","authors":"O. Mamyrbayev, Dina Oralbekova, A. Kydyrbekova, Tolganay Turdalykyzy, A. Bekarystankyzy","doi":"10.1109/ICCCI51764.2021.9486811","DOIUrl":null,"url":null,"abstract":"Automatic speech recognition is a rapidly developing area in machine learning. The most popular speech recognition systems today are end-to-end systems, especially those models that directly output a sequence of words taking into account the input sound in real time, which are online end-to-end models. Stream speech recognition allows to transfer the audio stream to speech-to-text conversion and get the results of stream speech recognition in real time as the audio is processed. This article discusses and implements a popular RNN-T-based model for recognizing Kazakh speech. The analysis of works related to recognition of Kazakh speech based on the CTC model is also given. The findings demonstrated that an RNN-T-based model can work well without additional components, like a language model and showed the best outcome on our dataset. As a result of the research, the system reached 10.6% CER, which is the best indicator among other end-to-end systems for recognizing Kazakh speech.","PeriodicalId":180004,"journal":{"name":"2021 3rd International Conference on Computer Communication and the Internet (ICCCI)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"End-to-End Model Based on RNN-T for Kazakh Speech Recognition\",\"authors\":\"O. Mamyrbayev, Dina Oralbekova, A. Kydyrbekova, Tolganay Turdalykyzy, A. Bekarystankyzy\",\"doi\":\"10.1109/ICCCI51764.2021.9486811\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic speech recognition is a rapidly developing area in machine learning. The most popular speech recognition systems today are end-to-end systems, especially those models that directly output a sequence of words taking into account the input sound in real time, which are online end-to-end models. Stream speech recognition allows to transfer the audio stream to speech-to-text conversion and get the results of stream speech recognition in real time as the audio is processed. This article discusses and implements a popular RNN-T-based model for recognizing Kazakh speech. The analysis of works related to recognition of Kazakh speech based on the CTC model is also given. The findings demonstrated that an RNN-T-based model can work well without additional components, like a language model and showed the best outcome on our dataset. As a result of the research, the system reached 10.6% CER, which is the best indicator among other end-to-end systems for recognizing Kazakh speech.\",\"PeriodicalId\":180004,\"journal\":{\"name\":\"2021 3rd International Conference on Computer Communication and the Internet (ICCCI)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 3rd International Conference on Computer Communication and the Internet (ICCCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCCI51764.2021.9486811\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Conference on Computer Communication and the Internet (ICCCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCI51764.2021.9486811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

自动语音识别是机器学习中一个快速发展的领域。当今最流行的语音识别系统是端到端系统，特别是那些直接输出单词序列的模型，考虑到实时输入的声音，这是在线端到端模型。流语音识别允许将音频流转换为语音到文本的转换，并在音频被处理的同时实时得到流语音识别的结果。本文讨论并实现了一种流行的基于rnn的哈萨克语语音识别模型。本文还对基于CTC模型的哈萨克语语音识别相关工作进行了分析。研究结果表明，基于rnn的模型可以在没有额外组件(如语言模型)的情况下很好地工作，并在我们的数据集上显示出最佳结果。研究结果表明，该系统的识别率达到10.6%，是其他端到端系统中识别哈萨克语语音的最佳指标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

End-to-End Model Based on RNN-T for Kazakh Speech Recognition

Automatic speech recognition is a rapidly developing area in machine learning. The most popular speech recognition systems today are end-to-end systems, especially those models that directly output a sequence of words taking into account the input sound in real time, which are online end-to-end models. Stream speech recognition allows to transfer the audio stream to speech-to-text conversion and get the results of stream speech recognition in real time as the audio is processed. This article discusses and implements a popular RNN-T-based model for recognizing Kazakh speech. The analysis of works related to recognition of Kazakh speech based on the CTC model is also given. The findings demonstrated that an RNN-T-based model can work well without additional components, like a language model and showed the best outcome on our dataset. As a result of the research, the system reached 10.6% CER, which is the best indicator among other end-to-end systems for recognizing Kazakh speech.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 3rd International Conference on Computer Communication and the Internet (ICCCI)

自引率

0.00%

发文量