{"title":"Amharic document clustering using semantic information from neural word embedding and encyclopedic knowledge","authors":"Dessalew Yohannes , Yenewondim Biadgie Sinshaw , Surafiel Habib Asefa , Yaregal Assabie","doi":"10.1016/j.sciaf.2025.e02657","DOIUrl":null,"url":null,"abstract":"<div><div>Amharic is the working language of Ethiopia, and its complex morphology, coupled with limited usable resources, makes the development of text processing applications for the language a challenging task. In this paper, we introduce Amharic document clustering system using an integration of the semantic information extracted from the word embedding model and encyclopedic knowledge. The encyclopedic knowledge is stored as a database with tree-like structures, enabling the construction of structured concepts, whereas the word embedding is used to capture the contextual relatedness between two concepts. Text features are extracted and further expanded with semantically similar concepts by mapping document words to the structured concepts constructed from the encyclopedic knowledge. The expanded text features are subsequently weighted using the TF-IDF method, resulting in a weighted document-by-term matrix. Finally, documents are clustered based on this matrix using the spherical <span><math><mi>k</mi></math></span>-means algorithm. The proposed system is tested using an Amharic text corpus and the Amharic Wikipedia which is utilized as encyclopedic knowledge. The implementation is carried out in low-resource setting and we use word embedding to capture the semantic information of terms. Various experiments are conducted, and test results show that the use of encyclopedic knowledge with semantic information shows better performance in comparison to other conventional clustering techniques, providing new insights for advancing text clustering, especially for low-resourced languages where computational and linguistic resources are limited.</div></div>","PeriodicalId":21690,"journal":{"name":"Scientific African","volume":"28 ","pages":"Article e02657"},"PeriodicalIF":2.7000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific African","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468227625001279","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Amharic is the working language of Ethiopia, and its complex morphology, coupled with limited usable resources, makes the development of text processing applications for the language a challenging task. In this paper, we introduce Amharic document clustering system using an integration of the semantic information extracted from the word embedding model and encyclopedic knowledge. The encyclopedic knowledge is stored as a database with tree-like structures, enabling the construction of structured concepts, whereas the word embedding is used to capture the contextual relatedness between two concepts. Text features are extracted and further expanded with semantically similar concepts by mapping document words to the structured concepts constructed from the encyclopedic knowledge. The expanded text features are subsequently weighted using the TF-IDF method, resulting in a weighted document-by-term matrix. Finally, documents are clustered based on this matrix using the spherical -means algorithm. The proposed system is tested using an Amharic text corpus and the Amharic Wikipedia which is utilized as encyclopedic knowledge. The implementation is carried out in low-resource setting and we use word embedding to capture the semantic information of terms. Various experiments are conducted, and test results show that the use of encyclopedic knowledge with semantic information shows better performance in comparison to other conventional clustering techniques, providing new insights for advancing text clustering, especially for low-resourced languages where computational and linguistic resources are limited.