Contextual word disambiguates of Ge'ez language with homophonic using machine learning

Q1 Arts and Humanities
Mequanent Degu Belete , Ayodeji Olalekan Salau , Girma Kassa Alitasb , Tigist Bezabh
{"title":"Contextual word disambiguates of Ge'ez language with homophonic using machine learning","authors":"Mequanent Degu Belete ,&nbsp;Ayodeji Olalekan Salau ,&nbsp;Girma Kassa Alitasb ,&nbsp;Tigist Bezabh","doi":"10.1016/j.amper.2024.100169","DOIUrl":null,"url":null,"abstract":"<div><p>According to natural language processing experts, there are numerous ambiguous words in languages. Without automated word meaning disambiguation for any language, the development of natural language processing technologies such as information extraction, information retrieval, machine translation, and others are still challenging task. Therfore, this paper presents the development of a word sense disambiguation model for duplicate alphabet words for the Ge'ez language using corpus-based methods. Because there is no wordNet or public dataset for the Ge'ez language, 1010 samples of ambiguous words were gathered. Afterwards, the words were preprocessed and the text was vectorized using bag of words, Term Frequency-Inverse Document Frequency, and word embeddings such as word2vec and fastText. The vectorized texts are then analysed using the supervised machine learning algorithms such Naive Bayes, decision trees, random forests, K-nearest neighbor, linear support vector machine, and logistic regression. Bag of words paired with random forests outperformed all other combinations, with an accuracy of 99.52%. However, when Deep learning algorithms such as Deep neural network and Long Short-Term memory were used for the same dataset, a 100% accuracy was achieved.</p></div>","PeriodicalId":35076,"journal":{"name":"Ampersand","volume":"12 ","pages":"Article 100169"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2215039024000079/pdfft?md5=0934037108a95b4f2d122f8cf99666cb&pid=1-s2.0-S2215039024000079-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ampersand","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2215039024000079","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Arts and Humanities","Score":null,"Total":0}
引用次数: 0

Abstract

According to natural language processing experts, there are numerous ambiguous words in languages. Without automated word meaning disambiguation for any language, the development of natural language processing technologies such as information extraction, information retrieval, machine translation, and others are still challenging task. Therfore, this paper presents the development of a word sense disambiguation model for duplicate alphabet words for the Ge'ez language using corpus-based methods. Because there is no wordNet or public dataset for the Ge'ez language, 1010 samples of ambiguous words were gathered. Afterwards, the words were preprocessed and the text was vectorized using bag of words, Term Frequency-Inverse Document Frequency, and word embeddings such as word2vec and fastText. The vectorized texts are then analysed using the supervised machine learning algorithms such Naive Bayes, decision trees, random forests, K-nearest neighbor, linear support vector machine, and logistic regression. Bag of words paired with random forests outperformed all other combinations, with an accuracy of 99.52%. However, when Deep learning algorithms such as Deep neural network and Long Short-Term memory were used for the same dataset, a 100% accuracy was achieved.

利用机器学习对 Ge'ez 语的同音字进行上下文分词
自然语言处理专家认为,语言中存在大量的歧义词。如果不对任何语言进行自动词义消歧,信息提取、信息检索、机器翻译等自然语言处理技术的发展仍然是一项具有挑战性的任务。因此,本文采用基于语料库的方法,为 Ge'ez 语的重复字母词建立了一个词义消歧模型。由于没有 Ge'ez 语的 wordNet 或公共数据集,因此收集了 1010 个含混词样本。然后,使用词袋、词频-反向文档频率和词嵌入(如 word2vec 和 fastText)对这些词进行预处理和文本矢量化。然后,使用 Naive Bayes、决策树、随机森林、K-近邻、线性支持向量机和逻辑回归等监督机器学习算法来分析矢量化文本。与随机森林配对的词袋的准确率为 99.52%,优于所有其他组合。然而,当深度神经网络和长短期记忆等深度学习算法用于同一数据集时,准确率达到了 100%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Ampersand
Ampersand Arts and Humanities-Language and Linguistics
CiteScore
1.60
自引率
0.00%
发文量
9
审稿时长
24 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信