Early Stages of Automatic Speech Recognition (ASR) in Non-english Speaking Countries and Factors That Affect the Recognition Process

American Journal of Neural Networks and Applications Pub Date : 1900-01-01 DOI:10.11648/j.ajnna.20210701.13

Dunya Yousufzai

{"title":"Early Stages of Automatic Speech Recognition (ASR) in Non-english Speaking Countries and Factors That Affect the Recognition Process","authors":"Dunya Yousufzai","doi":"10.11648/j.ajnna.20210701.13","DOIUrl":null,"url":null,"abstract":"There has been a considerable stream in ASR over the past few decades, but it may seem strange why this field is still a subject for researchers to work on. There are many reasons, but somewhat because the discipline is created with the promise of human-level performance under pragmatic states and this is an inextricable problem. In addition, the increasing advancement of technology in various fields has caused a more compelling need for this field. Especially the establishment of such a system in the security sector in insecure third world countries such as Afghanistan is an urgent need. This paper began with the reflection of all the necessary knowledge about speech recognition and then suggested an unprecedented method for building an automated speech recognition (ASR) system in the Dari language using the two most powerful open source engines CMUSphinx, from Carnegie Mellon University and DeepSpeech v0.9.3 /. These systems are much more impressive than early speech recognition systems. Using my own collected dataset, a speech-to-text model has been trained for the Dari language. Firstly, the dataset is filtered according to the task, then demonstrated the possible compatibility from the hidden Markov (HMM) models, the phoneme concept to RNN training. The system surpassed previously predicted results, as CMUSphinx stated, “for a typical 10-hour operation, the WER should be around 10%.\" Finally, 3.3% WER was achieved with 10.3-hours of audio recording using CMUSphinx. 1% WER with DeepSpeech.","PeriodicalId":325288,"journal":{"name":"American Journal of Neural Networks and Applications","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Neural Networks and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11648/j.ajnna.20210701.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

There has been a considerable stream in ASR over the past few decades, but it may seem strange why this field is still a subject for researchers to work on. There are many reasons, but somewhat because the discipline is created with the promise of human-level performance under pragmatic states and this is an inextricable problem. In addition, the increasing advancement of technology in various fields has caused a more compelling need for this field. Especially the establishment of such a system in the security sector in insecure third world countries such as Afghanistan is an urgent need. This paper began with the reflection of all the necessary knowledge about speech recognition and then suggested an unprecedented method for building an automated speech recognition (ASR) system in the Dari language using the two most powerful open source engines CMUSphinx, from Carnegie Mellon University and DeepSpeech v0.9.3 /. These systems are much more impressive than early speech recognition systems. Using my own collected dataset, a speech-to-text model has been trained for the Dari language. Firstly, the dataset is filtered according to the task, then demonstrated the possible compatibility from the hidden Markov (HMM) models, the phoneme concept to RNN training. The system surpassed previously predicted results, as CMUSphinx stated, “for a typical 10-hour operation, the WER should be around 10%." Finally, 3.3% WER was achieved with 10.3-hours of audio recording using CMUSphinx. 1% WER with DeepSpeech.

查看原文本刊更多论文

非英语国家自动语音识别(ASR)的早期阶段及影响识别过程的因素

在过去的几十年里，ASR已经有了相当大的发展，但为什么这个领域仍然是研究人员工作的主题，这似乎很奇怪。原因有很多，但在某种程度上是因为这门学科是在实用主义状态下以人类水平的表现为承诺而创建的，这是一个无法解决的问题。此外，各领域技术的不断进步对这一领域的需求也越来越迫切。特别是迫切需要在象阿富汗这样不安全的第三世界国家的安全部门建立这样一种制度。本文从反思语音识别的所有必要知识开始，然后提出了一种前所未有的方法，使用卡内基梅隆大学的两个最强大的开源引擎CMUSphinx和DeepSpeech v0.9.3 /，在Dari语言中构建自动语音识别(ASR)系统。这些系统比早期的语音识别系统更令人印象深刻。使用我自己收集的数据集，已经为达里语言训练了一个语音到文本的模型。首先根据任务对数据集进行过滤，然后论证隐马尔可夫模型、音素概念与RNN训练的可能兼容性。正如CMUSphinx所说，该系统超出了之前的预测结果，“对于典型的10小时操作，WER应该在10%左右。”最后，使用CMUSphinx录制10.3小时的音频，达到3.3%的WER。1%的深度语音。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

American Journal of Neural Networks and Applications

自引率

0.00%

发文量