{"title":"Semantic Document Clustering using K-means algorithm and Ward's Method","authors":"Niyaz Salih, Karwan Jacksi","doi":"10.1109/ICOASE51841.2020.9436588","DOIUrl":null,"url":null,"abstract":"Nowadays in the age of technology, textual documents are rapidly growing over the internet. Offline and online documents, websites, e-mails, social network and blog posts, are archived in electronic structured databases. It is very hard to maintain and reach these documents without acceptable ranking and provide demand clustering while there is classification without any details. This paper presents an approach based on semantic similarity for clustering documents using the NLTK dictionary. The procedure is done by defining synopses from IMDB and Wikipedia datasets, tokenizing and stemming them. Next, a vector space is constructed using TFIDF, and the clustering is done using the ward's method and K-mean algorithm. WordNet is also used to semantically cluster documents. The results are visualized and presented as an interactive website describing the relationship between all clusters. For each algorithm three scenarios are considered for the implementations: 1) without preprocessing, 2) preprocessing without stemming, and 3) preprocessing with stemming. The Silhouette metric and seven other metrics are used to measure the similarity with the five different datasets. Using the K-means algorithm, the best similarity ratio acquired from the Silhouette metric with (nltk-Reuters) dataset for all clusters, and the highest ratio is when k=10. Similarly, with Ward's algorithm, the highest similarity ratio of the Silhouette metric obtained using (IMDB and Wiki top 100 movies, and nltk-brown) datasets together for all clusters, and best similarity ratio is obtained when k=5 using the (IMDB and Wiki top 100 movies) dataset. The results are compared with the literature, and the outcome exposed that the Ward's method outperforms the results of K-means for small datasets.","PeriodicalId":126112,"journal":{"name":"2020 International Conference on Advanced Science and Engineering (ICOASE)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Advanced Science and Engineering (ICOASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOASE51841.2020.9436588","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Nowadays in the age of technology, textual documents are rapidly growing over the internet. Offline and online documents, websites, e-mails, social network and blog posts, are archived in electronic structured databases. It is very hard to maintain and reach these documents without acceptable ranking and provide demand clustering while there is classification without any details. This paper presents an approach based on semantic similarity for clustering documents using the NLTK dictionary. The procedure is done by defining synopses from IMDB and Wikipedia datasets, tokenizing and stemming them. Next, a vector space is constructed using TFIDF, and the clustering is done using the ward's method and K-mean algorithm. WordNet is also used to semantically cluster documents. The results are visualized and presented as an interactive website describing the relationship between all clusters. For each algorithm three scenarios are considered for the implementations: 1) without preprocessing, 2) preprocessing without stemming, and 3) preprocessing with stemming. The Silhouette metric and seven other metrics are used to measure the similarity with the five different datasets. Using the K-means algorithm, the best similarity ratio acquired from the Silhouette metric with (nltk-Reuters) dataset for all clusters, and the highest ratio is when k=10. Similarly, with Ward's algorithm, the highest similarity ratio of the Silhouette metric obtained using (IMDB and Wiki top 100 movies, and nltk-brown) datasets together for all clusters, and best similarity ratio is obtained when k=5 using the (IMDB and Wiki top 100 movies) dataset. The results are compared with the literature, and the outcome exposed that the Ward's method outperforms the results of K-means for small datasets.