Unicode Sinhala and phonetic English bi-directional conversion for Sinhala speech recognizer

M. Punchimudiyanse, R. Meegama
{"title":"Unicode Sinhala and phonetic English bi-directional conversion for Sinhala speech recognizer","authors":"M. Punchimudiyanse, R. Meegama","doi":"10.1109/ICIINFS.2015.7399027","DOIUrl":null,"url":null,"abstract":"An automated speech recognizer (ASR) having a large vocabulary is yet to be developed for the Sinhala language because of the time consumed in gathering the training data to build a language model. The dictionary and building the language model require non-English text, in our case, Sinhala Unicode, to be transcribed in phonetic English text. Unlike text to speech conversions which only require transcribing the non- English text to phonetic English text, an ASR needs correct reproduction of the original language text when the phonetic English text is produced as the output of the speech recognizer. In the present research, newspaper articles are used to gather a large set of sentences to build a language model having thousands of words for the Sphinx ASR. We present a decoder algorithm that produces phonetic English text from Sinhala Unicode text and an encoder algorithm that produces the correct reproduction of Unicode Sinhala text from phonetic English. For a near phonetic tag set for Sinhala alphabet, results indicate 100% accuracy for the decoder algorithm while for numberless text, accuracy of the encoder algorithm stands at 98.61% for distinct phonetic English words.","PeriodicalId":174378,"journal":{"name":"2015 IEEE 10th International Conference on Industrial and Information Systems (ICIIS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 10th International Conference on Industrial and Information Systems (ICIIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIINFS.2015.7399027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

An automated speech recognizer (ASR) having a large vocabulary is yet to be developed for the Sinhala language because of the time consumed in gathering the training data to build a language model. The dictionary and building the language model require non-English text, in our case, Sinhala Unicode, to be transcribed in phonetic English text. Unlike text to speech conversions which only require transcribing the non- English text to phonetic English text, an ASR needs correct reproduction of the original language text when the phonetic English text is produced as the output of the speech recognizer. In the present research, newspaper articles are used to gather a large set of sentences to build a language model having thousands of words for the Sphinx ASR. We present a decoder algorithm that produces phonetic English text from Sinhala Unicode text and an encoder algorithm that produces the correct reproduction of Unicode Sinhala text from phonetic English. For a near phonetic tag set for Sinhala alphabet, results indicate 100% accuracy for the decoder algorithm while for numberless text, accuracy of the encoder algorithm stands at 98.61% for distinct phonetic English words.
Unicode僧伽罗语和语音英语双向转换的僧伽罗语语音识别器
由于收集训练数据以建立语言模型需要花费大量时间,因此尚未开发具有大量词汇的自动语音识别器(ASR)。字典和构建语言模型需要将非英语文本(在我们的例子中是僧伽罗Unicode)转录为语音英语文本。与只需要将非英语文本转录为语音英语文本的文本到语音转换不同,当语音英语文本作为语音识别器的输出产生时,ASR需要正确地复制原始语言文本。在本研究中,我们使用报纸文章来收集大量的句子,为Sphinx ASR建立一个包含数千个单词的语言模型。我们提出了一种解码器算法,该算法从僧伽罗Unicode文本产生语音英语文本,以及一种编码器算法,该算法从语音英语产生Unicode僧伽罗文本的正确复制。对于僧伽罗字母的近语音标签集,结果表明解码器算法的准确率为100%,而对于无数文本,编码器算法的准确率为98.61%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信