Generation and Splitting of the Compound Words in Nepali Text

Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings Pub Date : 2022-09-19 DOI:10.36548/jitdw.2022.3.007

Prabin Acharya, S. Shakya

{"title":"Generation and Splitting of the Compound Words in Nepali Text","authors":"Prabin Acharya, S. Shakya","doi":"10.36548/jitdw.2022.3.007","DOIUrl":null,"url":null,"abstract":"In Nepali language, compound word formation is mostly associated with inflection, derivation, and postposition attachment. Inflection occurs due to suffixation, whereas derivation is driven by both prefixation and suffixation. The compound word generated by the rules may produce lots of out-of-vocabulary words due to limited lexical resources and numerous exceptions. Hence, the machine learning approach can help to generate valid compounds and split them into valid morphemes that can be further used as a resource for spelling suggestions, information retrieval, and machine translation. In this research, a method to generate valid compounds from the corresponding compound splits (head word and prefix/suffix/ postpositions) is suggested. A BiLSTM based deep learning approach was used to generate and split the valid compound words. Publicly available Nepali Brihat Shabdakosh data from Nepal Academy and scraped news data were used for the experimentation. The obtained results were found to be outstanding compared to the rule-based approach applied to a similar job.","PeriodicalId":74231,"journal":{"name":"Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36548/jitdw.2022.3.007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In Nepali language, compound word formation is mostly associated with inflection, derivation, and postposition attachment. Inflection occurs due to suffixation, whereas derivation is driven by both prefixation and suffixation. The compound word generated by the rules may produce lots of out-of-vocabulary words due to limited lexical resources and numerous exceptions. Hence, the machine learning approach can help to generate valid compounds and split them into valid morphemes that can be further used as a resource for spelling suggestions, information retrieval, and machine translation. In this research, a method to generate valid compounds from the corresponding compound splits (head word and prefix/suffix/ postpositions) is suggested. A BiLSTM based deep learning approach was used to generate and split the valid compound words. Publicly available Nepali Brihat Shabdakosh data from Nepal Academy and scraped news data were used for the experimentation. The obtained results were found to be outstanding compared to the rule-based approach applied to a similar job.

查看原文本刊更多论文

尼泊尔语文本中复合词的生成与分裂

在尼泊尔语中，复合词的构成主要与屈折、衍生和后置连接有关。屈折是由后缀引起的，而派生是由前缀和后缀共同引起的。由规则生成的复合词由于词汇资源有限，且例外情况众多，可能产生大量的词汇外词。因此，机器学习方法可以帮助生成有效的复合词，并将它们分割成有效的语素，这些语素可以进一步用作拼写建议、信息检索和机器翻译的资源。在本研究中，提出了一种从相应的复合词分割(首词和前缀/后缀/后置)中生成有效复合词的方法。采用基于BiLSTM的深度学习方法生成并拆分有效复合词。实验使用了尼泊尔学院公开的尼泊尔Brihat Shabdakosh数据和抓取的新闻数据。与应用于类似工作的基于规则的方法相比，所获得的结果是突出的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings

自引率

0.00%

发文量