OOV Handling Using Partial Lemma-Based Language Model in LF-MMI Based ASR for Bahasa Indonesia

Agung Santosa, Asril Jarin, E. M. Yuniarno, Hammam Riza, M. Purnomo
{"title":"OOV Handling Using Partial Lemma-Based Language Model in LF-MMI Based ASR for Bahasa Indonesia","authors":"Agung Santosa, Asril Jarin, E. M. Yuniarno, Hammam Riza, M. Purnomo","doi":"10.1109/CENIM56801.2022.10037479","DOIUrl":null,"url":null,"abstract":"One of the common problems in ASR is the out-of-vocabulary word in an utterance that can degrade the performance of the system. Bahasa Indonesia, as an agglutinative language, uses affixation to generate words from a set of affixes and root words. We propose the use of a partial lemma-based language model (LM) and lexicon that can handle words created from affixation. The partial lemma-based LM and lexicon are created from the original ones using morphology analyzer output as a reference. The experiment shows that using the LM in ASR with LF-MMI cost function gives a better WER when the heuristic to insert inter-word short pause is modified to also consider the affixes.","PeriodicalId":118934,"journal":{"name":"2022 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM)","volume":"17 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CENIM56801.2022.10037479","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

One of the common problems in ASR is the out-of-vocabulary word in an utterance that can degrade the performance of the system. Bahasa Indonesia, as an agglutinative language, uses affixation to generate words from a set of affixes and root words. We propose the use of a partial lemma-based language model (LM) and lexicon that can handle words created from affixation. The partial lemma-based LM and lexicon are created from the original ones using morphology analyzer output as a reference. The experiment shows that using the LM in ASR with LF-MMI cost function gives a better WER when the heuristic to insert inter-word short pause is modified to also consider the affixes.
基于部分引理的语言模型在基于LF-MMI的印尼语语音识别中的OOV处理
语音识别中常见的问题之一是话语中的词汇外词,这可能会降低系统的性能。印尼语作为一种粘连语言,使用词缀从一组词缀和词根词中生成单词。我们建议使用部分基于引理的语言模型(LM)和词典来处理由词缀创建的单词。部分基于引理的LM和词典是在原始LM和词典的基础上以形态学分析器的输出为参考创建的。实验表明,将LM用于带LF-MMI代价函数的ASR中,当将插入词间短停顿的启发式方法修改为考虑词缀时,可以获得更好的WER。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信