Generation and Splitting of the Compound Words in Nepali Text

Prabin Acharya, S. Shakya
{"title":"Generation and Splitting of the Compound Words in Nepali Text","authors":"Prabin Acharya, S. Shakya","doi":"10.36548/jitdw.2022.3.007","DOIUrl":null,"url":null,"abstract":"In Nepali language, compound word formation is mostly associated with inflection, derivation, and postposition attachment. Inflection occurs due to suffixation, whereas derivation is driven by both prefixation and suffixation. The compound word generated by the rules may produce lots of out-of-vocabulary words due to limited lexical resources and numerous exceptions. Hence, the machine learning approach can help to generate valid compounds and split them into valid morphemes that can be further used as a resource for spelling suggestions, information retrieval, and machine translation. In this research, a method to generate valid compounds from the corresponding compound splits (head word and prefix/suffix/ postpositions) is suggested. A BiLSTM based deep learning approach was used to generate and split the valid compound words. Publicly available Nepali Brihat Shabdakosh data from Nepal Academy and scraped news data were used for the experimentation. The obtained results were found to be outstanding compared to the rule-based approach applied to a similar job.","PeriodicalId":74231,"journal":{"name":"Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36548/jitdw.2022.3.007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In Nepali language, compound word formation is mostly associated with inflection, derivation, and postposition attachment. Inflection occurs due to suffixation, whereas derivation is driven by both prefixation and suffixation. The compound word generated by the rules may produce lots of out-of-vocabulary words due to limited lexical resources and numerous exceptions. Hence, the machine learning approach can help to generate valid compounds and split them into valid morphemes that can be further used as a resource for spelling suggestions, information retrieval, and machine translation. In this research, a method to generate valid compounds from the corresponding compound splits (head word and prefix/suffix/ postpositions) is suggested. A BiLSTM based deep learning approach was used to generate and split the valid compound words. Publicly available Nepali Brihat Shabdakosh data from Nepal Academy and scraped news data were used for the experimentation. The obtained results were found to be outstanding compared to the rule-based approach applied to a similar job.
尼泊尔语文本中复合词的生成与分裂
在尼泊尔语中,复合词的构成主要与屈折、衍生和后置连接有关。屈折是由后缀引起的,而派生是由前缀和后缀共同引起的。由规则生成的复合词由于词汇资源有限,且例外情况众多,可能产生大量的词汇外词。因此,机器学习方法可以帮助生成有效的复合词,并将它们分割成有效的语素,这些语素可以进一步用作拼写建议、信息检索和机器翻译的资源。在本研究中,提出了一种从相应的复合词分割(首词和前缀/后缀/后置)中生成有效复合词的方法。采用基于BiLSTM的深度学习方法生成并拆分有效复合词。实验使用了尼泊尔学院公开的尼泊尔Brihat Shabdakosh数据和抓取的新闻数据。与应用于类似工作的基于规则的方法相比,所获得的结果是突出的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信