Amharic document clustering using semantic information from neural word embedding and encyclopedic knowledge

IF 2.7 Q2 MULTIDISCIPLINARY SCIENCES

Scientific African Pub Date : 2025-04-03 DOI:10.1016/j.sciaf.2025.e02657

Dessalew Yohannes , Yenewondim Biadgie Sinshaw , Surafiel Habib Asefa , Yaregal Assabie

{"title":"Amharic document clustering using semantic information from neural word embedding and encyclopedic knowledge","authors":"Dessalew Yohannes , Yenewondim Biadgie Sinshaw , Surafiel Habib Asefa , Yaregal Assabie","doi":"10.1016/j.sciaf.2025.e02657","DOIUrl":null,"url":null,"abstract":"<div><div>Amharic is the working language of Ethiopia, and its complex morphology, coupled with limited usable resources, makes the development of text processing applications for the language a challenging task. In this paper, we introduce Amharic document clustering system using an integration of the semantic information extracted from the word embedding model and encyclopedic knowledge. The encyclopedic knowledge is stored as a database with tree-like structures, enabling the construction of structured concepts, whereas the word embedding is used to capture the contextual relatedness between two concepts. Text features are extracted and further expanded with semantically similar concepts by mapping document words to the structured concepts constructed from the encyclopedic knowledge. The expanded text features are subsequently weighted using the TF-IDF method, resulting in a weighted document-by-term matrix. Finally, documents are clustered based on this matrix using the spherical <span><math><mi>k</mi></math></span>-means algorithm. The proposed system is tested using an Amharic text corpus and the Amharic Wikipedia which is utilized as encyclopedic knowledge. The implementation is carried out in low-resource setting and we use word embedding to capture the semantic information of terms. Various experiments are conducted, and test results show that the use of encyclopedic knowledge with semantic information shows better performance in comparison to other conventional clustering techniques, providing new insights for advancing text clustering, especially for low-resourced languages where computational and linguistic resources are limited.</div></div>","PeriodicalId":21690,"journal":{"name":"Scientific African","volume":"28 ","pages":"Article e02657"},"PeriodicalIF":2.7000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific African","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468227625001279","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Amharic is the working language of Ethiopia, and its complex morphology, coupled with limited usable resources, makes the development of text processing applications for the language a challenging task. In this paper, we introduce Amharic document clustering system using an integration of the semantic information extracted from the word embedding model and encyclopedic knowledge. The encyclopedic knowledge is stored as a database with tree-like structures, enabling the construction of structured concepts, whereas the word embedding is used to capture the contextual relatedness between two concepts. Text features are extracted and further expanded with semantically similar concepts by mapping document words to the structured concepts constructed from the encyclopedic knowledge. The expanded text features are subsequently weighted using the TF-IDF method, resulting in a weighted document-by-term matrix. Finally, documents are clustered based on this matrix using the spherical

k

-means algorithm. The proposed system is tested using an Amharic text corpus and the Amharic Wikipedia which is utilized as encyclopedic knowledge. The implementation is carried out in low-resource setting and we use word embedding to capture the semantic information of terms. Various experiments are conducted, and test results show that the use of encyclopedic knowledge with semantic information shows better performance in comparison to other conventional clustering techniques, providing new insights for advancing text clustering, especially for low-resourced languages where computational and linguistic resources are limited.

Abstract Image

查看原文本刊更多论文

基于神经词嵌入和百科知识的阿姆哈拉语文档聚类

阿姆哈拉语是埃塞俄比亚的工作语言，其复杂的形态加上有限的可用资源，使得该语言的文本处理应用程序的开发成为一项具有挑战性的任务。本文介绍了一种将词嵌入模型提取的语义信息与百科知识相结合的阿姆哈拉语文档聚类系统。将百科知识存储为具有树状结构的数据库，从而可以构建结构化的概念，而单词嵌入则用于捕获两个概念之间的上下文相关性。通过将文档词映射到从百科知识中构建的结构化概念，提取文本特征并进一步扩展语义相似的概念。随后，使用TF-IDF方法对扩展的文本特征进行加权，从而得到一个按词条加权的文档矩阵。最后，使用球面k-means算法基于该矩阵对文档进行聚类。使用阿姆哈拉语文本语料库和作为百科知识的阿姆哈拉语维基百科对该系统进行了测试。该方法在低资源环境下实现，利用词嵌入技术捕获词的语义信息。各种实验结果表明，与其他传统聚类技术相比，使用带有语义信息的百科知识表现出更好的性能，为推进文本聚类提供了新的见解，特别是对于计算和语言资源有限的低资源语言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊