O. Mamyrbayev, Dina Oralbekova, A. Kydyrbekova, Tolganay Turdalykyzy, A. Bekarystankyzy
{"title":"基于RNN-T的哈萨克语语音识别端到端模型","authors":"O. Mamyrbayev, Dina Oralbekova, A. Kydyrbekova, Tolganay Turdalykyzy, A. Bekarystankyzy","doi":"10.1109/ICCCI51764.2021.9486811","DOIUrl":null,"url":null,"abstract":"Automatic speech recognition is a rapidly developing area in machine learning. The most popular speech recognition systems today are end-to-end systems, especially those models that directly output a sequence of words taking into account the input sound in real time, which are online end-to-end models. Stream speech recognition allows to transfer the audio stream to speech-to-text conversion and get the results of stream speech recognition in real time as the audio is processed. This article discusses and implements a popular RNN-T-based model for recognizing Kazakh speech. The analysis of works related to recognition of Kazakh speech based on the CTC model is also given. The findings demonstrated that an RNN-T-based model can work well without additional components, like a language model and showed the best outcome on our dataset. As a result of the research, the system reached 10.6% CER, which is the best indicator among other end-to-end systems for recognizing Kazakh speech.","PeriodicalId":180004,"journal":{"name":"2021 3rd International Conference on Computer Communication and the Internet (ICCCI)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"End-to-End Model Based on RNN-T for Kazakh Speech Recognition\",\"authors\":\"O. Mamyrbayev, Dina Oralbekova, A. Kydyrbekova, Tolganay Turdalykyzy, A. Bekarystankyzy\",\"doi\":\"10.1109/ICCCI51764.2021.9486811\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic speech recognition is a rapidly developing area in machine learning. The most popular speech recognition systems today are end-to-end systems, especially those models that directly output a sequence of words taking into account the input sound in real time, which are online end-to-end models. Stream speech recognition allows to transfer the audio stream to speech-to-text conversion and get the results of stream speech recognition in real time as the audio is processed. This article discusses and implements a popular RNN-T-based model for recognizing Kazakh speech. The analysis of works related to recognition of Kazakh speech based on the CTC model is also given. The findings demonstrated that an RNN-T-based model can work well without additional components, like a language model and showed the best outcome on our dataset. As a result of the research, the system reached 10.6% CER, which is the best indicator among other end-to-end systems for recognizing Kazakh speech.\",\"PeriodicalId\":180004,\"journal\":{\"name\":\"2021 3rd International Conference on Computer Communication and the Internet (ICCCI)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 3rd International Conference on Computer Communication and the Internet (ICCCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCCI51764.2021.9486811\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Conference on Computer Communication and the Internet (ICCCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCI51764.2021.9486811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
End-to-End Model Based on RNN-T for Kazakh Speech Recognition
Automatic speech recognition is a rapidly developing area in machine learning. The most popular speech recognition systems today are end-to-end systems, especially those models that directly output a sequence of words taking into account the input sound in real time, which are online end-to-end models. Stream speech recognition allows to transfer the audio stream to speech-to-text conversion and get the results of stream speech recognition in real time as the audio is processed. This article discusses and implements a popular RNN-T-based model for recognizing Kazakh speech. The analysis of works related to recognition of Kazakh speech based on the CTC model is also given. The findings demonstrated that an RNN-T-based model can work well without additional components, like a language model and showed the best outcome on our dataset. As a result of the research, the system reached 10.6% CER, which is the best indicator among other end-to-end systems for recognizing Kazakh speech.