Semantic Document Clustering using K-means algorithm and Ward's Method

2020 International Conference on Advanced Science and Engineering (ICOASE) Pub Date : 2020-12-23 DOI:10.1109/ICOASE51841.2020.9436588

Niyaz Salih, Karwan Jacksi

{"title":"Semantic Document Clustering using K-means algorithm and Ward's Method","authors":"Niyaz Salih, Karwan Jacksi","doi":"10.1109/ICOASE51841.2020.9436588","DOIUrl":null,"url":null,"abstract":"Nowadays in the age of technology, textual documents are rapidly growing over the internet. Offline and online documents, websites, e-mails, social network and blog posts, are archived in electronic structured databases. It is very hard to maintain and reach these documents without acceptable ranking and provide demand clustering while there is classification without any details. This paper presents an approach based on semantic similarity for clustering documents using the NLTK dictionary. The procedure is done by defining synopses from IMDB and Wikipedia datasets, tokenizing and stemming them. Next, a vector space is constructed using TFIDF, and the clustering is done using the ward's method and K-mean algorithm. WordNet is also used to semantically cluster documents. The results are visualized and presented as an interactive website describing the relationship between all clusters. For each algorithm three scenarios are considered for the implementations: 1) without preprocessing, 2) preprocessing without stemming, and 3) preprocessing with stemming. The Silhouette metric and seven other metrics are used to measure the similarity with the five different datasets. Using the K-means algorithm, the best similarity ratio acquired from the Silhouette metric with (nltk-Reuters) dataset for all clusters, and the highest ratio is when k=10. Similarly, with Ward's algorithm, the highest similarity ratio of the Silhouette metric obtained using (IMDB and Wiki top 100 movies, and nltk-brown) datasets together for all clusters, and best similarity ratio is obtained when k=5 using the (IMDB and Wiki top 100 movies) dataset. The results are compared with the literature, and the outcome exposed that the Ward's method outperforms the results of K-means for small datasets.","PeriodicalId":126112,"journal":{"name":"2020 International Conference on Advanced Science and Engineering (ICOASE)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Advanced Science and Engineering (ICOASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOASE51841.2020.9436588","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Nowadays in the age of technology, textual documents are rapidly growing over the internet. Offline and online documents, websites, e-mails, social network and blog posts, are archived in electronic structured databases. It is very hard to maintain and reach these documents without acceptable ranking and provide demand clustering while there is classification without any details. This paper presents an approach based on semantic similarity for clustering documents using the NLTK dictionary. The procedure is done by defining synopses from IMDB and Wikipedia datasets, tokenizing and stemming them. Next, a vector space is constructed using TFIDF, and the clustering is done using the ward's method and K-mean algorithm. WordNet is also used to semantically cluster documents. The results are visualized and presented as an interactive website describing the relationship between all clusters. For each algorithm three scenarios are considered for the implementations: 1) without preprocessing, 2) preprocessing without stemming, and 3) preprocessing with stemming. The Silhouette metric and seven other metrics are used to measure the similarity with the five different datasets. Using the K-means algorithm, the best similarity ratio acquired from the Silhouette metric with (nltk-Reuters) dataset for all clusters, and the highest ratio is when k=10. Similarly, with Ward's algorithm, the highest similarity ratio of the Silhouette metric obtained using (IMDB and Wiki top 100 movies, and nltk-brown) datasets together for all clusters, and best similarity ratio is obtained when k=5 using the (IMDB and Wiki top 100 movies) dataset. The results are compared with the literature, and the outcome exposed that the Ward's method outperforms the results of K-means for small datasets.

查看原文本刊更多论文

基于K-means算法和Ward方法的语义文档聚类

如今在技术时代，文本文档在互联网上迅速增长。离线和在线文件、网站、电子邮件、社交网络和博客文章都存档在电子结构化数据库中。如果没有可接受的排序和提供需求聚类，而没有任何细节的分类，则很难维护和获取这些文档。本文提出了一种基于语义相似度的NLTK词典文档聚类方法。这个过程是通过定义来自IMDB和Wikipedia数据集的概要、标记和词干来完成的。接下来，使用TFIDF构造向量空间，并使用ward方法和K-mean算法进行聚类。WordNet还用于在语义上对文档进行聚类。结果被可视化并呈现为描述所有集群之间关系的交互式网站。对于每种算法，考虑了三种实现方案:1)不进行预处理，2)不进行词干提取的预处理，以及3)进行词干提取的预处理。我们使用Silhouette指标和其他7个指标来衡量与5个不同数据集的相似性。使用k- means算法，从(nltk-Reuters)数据集的廓形度量中获得所有聚类的最佳相似比，当k=10时相似比最高。同样，使用Ward的算法，使用(IMDB和Wiki前100部电影，以及nltk-brown)数据集对所有聚类获得的剪影度量的最高相似比，使用(IMDB和Wiki前100部电影)数据集获得的最佳相似比为k=5。将结果与文献进行比较，结果表明Ward的方法在小数据集上优于K-means的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 International Conference on Advanced Science and Engineering (ICOASE)

自引率

0.00%

发文量