Urdu News Clustering Using K-Mean Algorithm On The Basis Of Jaccard Coefficient And Dice Coefficient Similarity

IF 1.7 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Zahida Rahman, Altaf Hussain, Hussain Shah, M. Arshad
{"title":"Urdu News Clustering Using K-Mean Algorithm On The Basis Of Jaccard Coefficient And Dice Coefficient Similarity","authors":"Zahida Rahman, Altaf Hussain, Hussain Shah, M. Arshad","doi":"10.14201/adcaij2021104381399","DOIUrl":null,"url":null,"abstract":"Clustering is the unsupervised machine learning process that group data objects into clusters such that objects within the same cluster are highly similar to one another. Every day the quantity of Urdu text is increasing at a high speed on the internet. Grouping Urdu news manually is almost impossible, and there is an utmost need to device a mechanism which cluster Urdu news documents based on their similarity. Clustering Urdu news documents with accuracy is a research issue and it can be solved by using similarity techniques i.e., Jaccard and Dice coefficient, and clustering k-mean algorithm. In this research, the Jaccard and Dice coefficient has been used to find the similarity score of Urdu News documents in python programming language. For the purpose of clustering, the similarity results have been loaded to Waikato Environment for Knowledge Analysis (WEKA), by using k-mean algorithm the Urdu news documents have been clustered into five clusters. The obtained cluster’s results were evaluated in terms of Accuracy and Mean Square Error (MSE). The Accuracy and MSE of Jaccard was 85% and 44.4%, while the Accuracy and MSE of Dice coefficient was 87% and 35.76%. The experimental result shows that Dice coefficient is better as compared to Jaccard similarity on the basis of Accuracy and MSE.","PeriodicalId":42597,"journal":{"name":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","volume":"39 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2022-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14201/adcaij2021104381399","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Clustering is the unsupervised machine learning process that group data objects into clusters such that objects within the same cluster are highly similar to one another. Every day the quantity of Urdu text is increasing at a high speed on the internet. Grouping Urdu news manually is almost impossible, and there is an utmost need to device a mechanism which cluster Urdu news documents based on their similarity. Clustering Urdu news documents with accuracy is a research issue and it can be solved by using similarity techniques i.e., Jaccard and Dice coefficient, and clustering k-mean algorithm. In this research, the Jaccard and Dice coefficient has been used to find the similarity score of Urdu News documents in python programming language. For the purpose of clustering, the similarity results have been loaded to Waikato Environment for Knowledge Analysis (WEKA), by using k-mean algorithm the Urdu news documents have been clustered into five clusters. The obtained cluster’s results were evaluated in terms of Accuracy and Mean Square Error (MSE). The Accuracy and MSE of Jaccard was 85% and 44.4%, while the Accuracy and MSE of Dice coefficient was 87% and 35.76%. The experimental result shows that Dice coefficient is better as compared to Jaccard similarity on the basis of Accuracy and MSE.
基于Jaccard系数和Dice系数相似度的k -均值乌尔都语新闻聚类
聚类是一种无监督的机器学习过程,它将数据对象分组到集群中,使同一集群中的对象彼此高度相似。互联网上乌尔都语文本的数量每天都在高速增长。手工对乌尔都语新闻进行分组几乎是不可能的,迫切需要建立一种基于乌尔都语新闻文档相似度的聚类机制。乌尔都语新闻文档的准确聚类是一个研究课题,可以采用相似度技术,即Jaccard and Dice系数和聚类k-mean算法来解决这一问题。本研究采用Jaccard and Dice系数在python编程语言中寻找乌尔都语新闻文档的相似度得分。为了聚类,将相似度结果加载到Waikato Environment For Knowledge Analysis (WEKA)中,利用k-mean算法将乌尔都语新闻文档聚类为5类。所获得的聚类结果根据准确性和均方误差(MSE)进行评估。Jaccard的准确率和MSE分别为85%和44.4%,Dice的准确率和MSE分别为87%和35.76%。实验结果表明,在准确率和均方差的基础上,Dice系数优于Jaccard相似度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.40
自引率
0.00%
发文量
22
审稿时长
4 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信