A Topic Modeling for Clustering Arabic Documents

Doaa Wahhab Alkhafaji, Sura Z. Al-Rashid
{"title":"A Topic Modeling for Clustering Arabic Documents","authors":"Doaa Wahhab Alkhafaji, Sura Z. Al-Rashid","doi":"10.1109/IT-ELA52201.2021.9773538","DOIUrl":null,"url":null,"abstract":"Topic modeling is a type of statistical data mining technique for discovering the abstract “topics” that occur in a collection of articles or documents and the most widely used topic modeling technique is LDA. In our paper, we tested the effectiveness of a recently developed topic modeling approach (LDA2Vec) that was introduced by Chris Moody with our Arabic dataset. LDA2Vec is a hybrid approach of LDA and a highly popular word-embedding model (Word2Vec). Our goal is to find a method for automatically clustering Arabic documents by topic and categorizing them for use in a recommendation system and searching. The performance of the model was evaluated using a corpus of Arabic documents divided into nine categories. Despite the grammatical variations between Arabic and English, the model worked well with the Arabic language when it was implemented, as we observed in our study. As a conclusion of our findings, LDA2Vec gave (82.40%) accuracy over topics for test documents, which is greater than LDA accuracy (67.96%), which was evaluated with the same dataset.","PeriodicalId":330552,"journal":{"name":"2021 2nd Information Technology To Enhance e-learning and Other Application (IT-ELA)","volume":"156 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 2nd Information Technology To Enhance e-learning and Other Application (IT-ELA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IT-ELA52201.2021.9773538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Topic modeling is a type of statistical data mining technique for discovering the abstract “topics” that occur in a collection of articles or documents and the most widely used topic modeling technique is LDA. In our paper, we tested the effectiveness of a recently developed topic modeling approach (LDA2Vec) that was introduced by Chris Moody with our Arabic dataset. LDA2Vec is a hybrid approach of LDA and a highly popular word-embedding model (Word2Vec). Our goal is to find a method for automatically clustering Arabic documents by topic and categorizing them for use in a recommendation system and searching. The performance of the model was evaluated using a corpus of Arabic documents divided into nine categories. Despite the grammatical variations between Arabic and English, the model worked well with the Arabic language when it was implemented, as we observed in our study. As a conclusion of our findings, LDA2Vec gave (82.40%) accuracy over topics for test documents, which is greater than LDA accuracy (67.96%), which was evaluated with the same dataset.
面向阿拉伯语文档聚类的主题建模
主题建模是一种统计数据挖掘技术,用于发现文章或文档集合中出现的抽象“主题”,最广泛使用的主题建模技术是LDA。在我们的论文中,我们测试了最近开发的主题建模方法(LDA2Vec)的有效性,该方法是由Chris Moody用我们的阿拉伯数据集引入的。LDA2Vec是LDA和一个非常流行的词嵌入模型(Word2Vec)的混合方法。我们的目标是找到一种按主题自动聚类阿拉伯语文档的方法,并对它们进行分类,以便在推荐系统和搜索中使用。该模型的性能使用阿拉伯语文档的语料库进行评估,该语料库分为九个类别。尽管阿拉伯语和英语之间存在语法差异,但正如我们在研究中观察到的那样,该模型在实施时对阿拉伯语运行良好。作为我们研究结果的结论,LDA2Vec在测试文档的主题上给出了(82.40%)的准确率,高于使用相同数据集评估的LDA准确率(67.96%)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信