{"title":"A Topic Modeling for Clustering Arabic Documents","authors":"Doaa Wahhab Alkhafaji, Sura Z. Al-Rashid","doi":"10.1109/IT-ELA52201.2021.9773538","DOIUrl":null,"url":null,"abstract":"Topic modeling is a type of statistical data mining technique for discovering the abstract “topics” that occur in a collection of articles or documents and the most widely used topic modeling technique is LDA. In our paper, we tested the effectiveness of a recently developed topic modeling approach (LDA2Vec) that was introduced by Chris Moody with our Arabic dataset. LDA2Vec is a hybrid approach of LDA and a highly popular word-embedding model (Word2Vec). Our goal is to find a method for automatically clustering Arabic documents by topic and categorizing them for use in a recommendation system and searching. The performance of the model was evaluated using a corpus of Arabic documents divided into nine categories. Despite the grammatical variations between Arabic and English, the model worked well with the Arabic language when it was implemented, as we observed in our study. As a conclusion of our findings, LDA2Vec gave (82.40%) accuracy over topics for test documents, which is greater than LDA accuracy (67.96%), which was evaluated with the same dataset.","PeriodicalId":330552,"journal":{"name":"2021 2nd Information Technology To Enhance e-learning and Other Application (IT-ELA)","volume":"156 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 2nd Information Technology To Enhance e-learning and Other Application (IT-ELA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IT-ELA52201.2021.9773538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Topic modeling is a type of statistical data mining technique for discovering the abstract “topics” that occur in a collection of articles or documents and the most widely used topic modeling technique is LDA. In our paper, we tested the effectiveness of a recently developed topic modeling approach (LDA2Vec) that was introduced by Chris Moody with our Arabic dataset. LDA2Vec is a hybrid approach of LDA and a highly popular word-embedding model (Word2Vec). Our goal is to find a method for automatically clustering Arabic documents by topic and categorizing them for use in a recommendation system and searching. The performance of the model was evaluated using a corpus of Arabic documents divided into nine categories. Despite the grammatical variations between Arabic and English, the model worked well with the Arabic language when it was implemented, as we observed in our study. As a conclusion of our findings, LDA2Vec gave (82.40%) accuracy over topics for test documents, which is greater than LDA accuracy (67.96%), which was evaluated with the same dataset.