Semantic Document Clustering using K-means algorithm and Ward's Method

Niyaz Salih, Karwan Jacksi
{"title":"Semantic Document Clustering using K-means algorithm and Ward's Method","authors":"Niyaz Salih, Karwan Jacksi","doi":"10.1109/ICOASE51841.2020.9436588","DOIUrl":null,"url":null,"abstract":"Nowadays in the age of technology, textual documents are rapidly growing over the internet. Offline and online documents, websites, e-mails, social network and blog posts, are archived in electronic structured databases. It is very hard to maintain and reach these documents without acceptable ranking and provide demand clustering while there is classification without any details. This paper presents an approach based on semantic similarity for clustering documents using the NLTK dictionary. The procedure is done by defining synopses from IMDB and Wikipedia datasets, tokenizing and stemming them. Next, a vector space is constructed using TFIDF, and the clustering is done using the ward's method and K-mean algorithm. WordNet is also used to semantically cluster documents. The results are visualized and presented as an interactive website describing the relationship between all clusters. For each algorithm three scenarios are considered for the implementations: 1) without preprocessing, 2) preprocessing without stemming, and 3) preprocessing with stemming. The Silhouette metric and seven other metrics are used to measure the similarity with the five different datasets. Using the K-means algorithm, the best similarity ratio acquired from the Silhouette metric with (nltk-Reuters) dataset for all clusters, and the highest ratio is when k=10. Similarly, with Ward's algorithm, the highest similarity ratio of the Silhouette metric obtained using (IMDB and Wiki top 100 movies, and nltk-brown) datasets together for all clusters, and best similarity ratio is obtained when k=5 using the (IMDB and Wiki top 100 movies) dataset. The results are compared with the literature, and the outcome exposed that the Ward's method outperforms the results of K-means for small datasets.","PeriodicalId":126112,"journal":{"name":"2020 International Conference on Advanced Science and Engineering (ICOASE)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Advanced Science and Engineering (ICOASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOASE51841.2020.9436588","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Nowadays in the age of technology, textual documents are rapidly growing over the internet. Offline and online documents, websites, e-mails, social network and blog posts, are archived in electronic structured databases. It is very hard to maintain and reach these documents without acceptable ranking and provide demand clustering while there is classification without any details. This paper presents an approach based on semantic similarity for clustering documents using the NLTK dictionary. The procedure is done by defining synopses from IMDB and Wikipedia datasets, tokenizing and stemming them. Next, a vector space is constructed using TFIDF, and the clustering is done using the ward's method and K-mean algorithm. WordNet is also used to semantically cluster documents. The results are visualized and presented as an interactive website describing the relationship between all clusters. For each algorithm three scenarios are considered for the implementations: 1) without preprocessing, 2) preprocessing without stemming, and 3) preprocessing with stemming. The Silhouette metric and seven other metrics are used to measure the similarity with the five different datasets. Using the K-means algorithm, the best similarity ratio acquired from the Silhouette metric with (nltk-Reuters) dataset for all clusters, and the highest ratio is when k=10. Similarly, with Ward's algorithm, the highest similarity ratio of the Silhouette metric obtained using (IMDB and Wiki top 100 movies, and nltk-brown) datasets together for all clusters, and best similarity ratio is obtained when k=5 using the (IMDB and Wiki top 100 movies) dataset. The results are compared with the literature, and the outcome exposed that the Ward's method outperforms the results of K-means for small datasets.
基于K-means算法和Ward方法的语义文档聚类
如今在技术时代,文本文档在互联网上迅速增长。离线和在线文件、网站、电子邮件、社交网络和博客文章都存档在电子结构化数据库中。如果没有可接受的排序和提供需求聚类,而没有任何细节的分类,则很难维护和获取这些文档。本文提出了一种基于语义相似度的NLTK词典文档聚类方法。这个过程是通过定义来自IMDB和Wikipedia数据集的概要、标记和词干来完成的。接下来,使用TFIDF构造向量空间,并使用ward方法和K-mean算法进行聚类。WordNet还用于在语义上对文档进行聚类。结果被可视化并呈现为描述所有集群之间关系的交互式网站。对于每种算法,考虑了三种实现方案:1)不进行预处理,2)不进行词干提取的预处理,以及3)进行词干提取的预处理。我们使用Silhouette指标和其他7个指标来衡量与5个不同数据集的相似性。使用k- means算法,从(nltk-Reuters)数据集的廓形度量中获得所有聚类的最佳相似比,当k=10时相似比最高。同样,使用Ward的算法,使用(IMDB和Wiki前100部电影,以及nltk-brown)数据集对所有聚类获得的剪影度量的最高相似比,使用(IMDB和Wiki前100部电影)数据集获得的最佳相似比为k=5。将结果与文献进行比较,结果表明Ward的方法在小数据集上优于K-means的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信