Akbayan Bekarystankyzy, Abdul Razaque, Orken Mamyrbayev
{"title":"Integrated end-to-end multilingual method for low-resource agglutinative languages using Cyrillic scripts","authors":"Akbayan Bekarystankyzy, Abdul Razaque, Orken Mamyrbayev","doi":"10.1016/j.jii.2024.100750","DOIUrl":null,"url":null,"abstract":"Millions of individuals across the world use automatic speech recognition (ASR) systems every day to dictate messages, operate gadgets, begin searches, and enable data entry in tiny devices. The engagement in these circumstances is determined by the accuracy of the voice transcriptions and the system's response. A second barrier to natural engagement for multilingual users is the monolingual nature of many ASR systems, which limit users to a single predefined language. A substantial amount of transcribed audio data must be used to train an ASR model to obtain one that is trustworthy and accurate. The absence of this data type affects a large number of languages, particularly agglutinative languages. Much research has been conducted using various strategies to improve models for low-resource languages. This study presents an integrated end-to-end multi-language ASR (EMASR) architecture that allows users to choose from a variety of spoken language combinations. The proposed EMASR presents an integrated design to support low-resource agglutinative languages by fusing the features of the multi-identifier module, voice fusion module, and recurrent neural network module. The proposed EMSAR identifies Turkic agglutinative languages (Kazakh, Bashkir, Kyrgyz, Saha, and Tatar) to enable multilingual training through the use of Connectionist Temporal Classification (CTC) and an attention mechanism that includes a language model (LM). The cognate word, sentence construction principles, and an alphabet are all present in these languages (Cyrillic). We use recent advancements in language identification to obtain recognition accuracy and latency characteristics. Experiment results reveal that multilingual training produces superior results than monolingual training in all languages tested. The Kazakh language obtained a spectacular result: word error rate (WER) was reduced to half and character error rate (CER) was reduced to one-third, demonstrating that this strategy may be beneficial for critically low-resource languages.","PeriodicalId":55975,"journal":{"name":"Journal of Industrial Information Integration","volume":"50 3 1","pages":""},"PeriodicalIF":10.4000,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Industrial Information Integration","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.jii.2024.100750","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Millions of individuals across the world use automatic speech recognition (ASR) systems every day to dictate messages, operate gadgets, begin searches, and enable data entry in tiny devices. The engagement in these circumstances is determined by the accuracy of the voice transcriptions and the system's response. A second barrier to natural engagement for multilingual users is the monolingual nature of many ASR systems, which limit users to a single predefined language. A substantial amount of transcribed audio data must be used to train an ASR model to obtain one that is trustworthy and accurate. The absence of this data type affects a large number of languages, particularly agglutinative languages. Much research has been conducted using various strategies to improve models for low-resource languages. This study presents an integrated end-to-end multi-language ASR (EMASR) architecture that allows users to choose from a variety of spoken language combinations. The proposed EMASR presents an integrated design to support low-resource agglutinative languages by fusing the features of the multi-identifier module, voice fusion module, and recurrent neural network module. The proposed EMSAR identifies Turkic agglutinative languages (Kazakh, Bashkir, Kyrgyz, Saha, and Tatar) to enable multilingual training through the use of Connectionist Temporal Classification (CTC) and an attention mechanism that includes a language model (LM). The cognate word, sentence construction principles, and an alphabet are all present in these languages (Cyrillic). We use recent advancements in language identification to obtain recognition accuracy and latency characteristics. Experiment results reveal that multilingual training produces superior results than monolingual training in all languages tested. The Kazakh language obtained a spectacular result: word error rate (WER) was reduced to half and character error rate (CER) was reduced to one-third, demonstrating that this strategy may be beneficial for critically low-resource languages.
全球每天有数百万人使用自动语音识别(ASR)系统口述信息、操作小工具、开始搜索并在微型设备中输入数据。在这种情况下,参与度取决于语音转录的准确性和系统的响应。影响多语言用户自然参与的第二个障碍是许多 ASR 系统的单语言性质,它们将用户限制在单一的预定义语言中。必须使用大量转录的音频数据来训练 ASR 模型,才能获得可信和准确的模型。这种数据类型的缺失影响了大量语言,尤其是凝集语言。为了改进低资源语言的模型,人们使用各种策略进行了大量研究。本研究提出了一种集成的端到端多语言 ASR(EMASR)架构,允许用户从各种口语组合中进行选择。通过融合多识别器模块、语音融合模块和递归神经网络模块的功能,拟议的 EMASR 采用了集成设计,以支持低资源聚合语言。拟议的 EMSAR 可识别突厥语聚合语言(哈萨克语、巴什基尔语、吉尔吉斯语、萨哈语和塔塔尔语),通过使用联结时态分类(CTC)和包含语言模型(LM)的注意机制,实现多语言训练。这些语言(西里尔语)中都有同源词、造句原则和字母表。我们利用语言识别领域的最新进展来获得识别准确率和延迟特征。实验结果表明,在所有测试语言中,多语种训练比单语种训练的效果更好。哈萨克语取得了令人瞩目的成果:单词错误率(WER)降低到一半,字符错误率(CER)降低到三分之一,这表明这种策略可能对资源严重匮乏的语言有益。
期刊介绍:
The Journal of Industrial Information Integration focuses on the industry's transition towards industrial integration and informatization, covering not only hardware and software but also information integration. It serves as a platform for promoting advances in industrial information integration, addressing challenges, issues, and solutions in an interdisciplinary forum for researchers, practitioners, and policy makers.
The Journal of Industrial Information Integration welcomes papers on foundational, technical, and practical aspects of industrial information integration, emphasizing the complex and cross-disciplinary topics that arise in industrial integration. Techniques from mathematical science, computer science, computer engineering, electrical and electronic engineering, manufacturing engineering, and engineering management are crucial in this context.