Shapol M. Mohammed, Karwan Jacksi, Subhi R. M. Zeebaree
{"title":"Glove Word Embedding and DBSCAN algorithms for Semantic Document Clustering","authors":"Shapol M. Mohammed, Karwan Jacksi, Subhi R. M. Zeebaree","doi":"10.1109/ICOASE51841.2020.9436540","DOIUrl":null,"url":null,"abstract":"In the recently developed document clustering, word embedding has the primary role in constructing semantics, considering and measuring the times a specific word appears in its context. Word2vect and Glove word embedding are the two most used word embeddings in document clustering. Previous works do not consider the use of glove word embedding with DBSCAN clustering algorithm in document clustering. In this work, a preprocessing with and without stemming of Wikipedia and IMDB datasets applied to glove word embedding algorithm, then word vectors as a result are applied to the DBSCAN clustering algorithm. For the evaluation of experiments, seven metrics have been used: Silhouette average, purity, accuracy, F1, completeness, homogeneity, and NMI score. The experimental results are compared with the results of TFIDF and K-means algorithms on six datasets. The results of this work outperform the results of the TFIDF and K-means approach using the four main evaluation metrics and CPU time consuming.","PeriodicalId":126112,"journal":{"name":"2020 International Conference on Advanced Science and Engineering (ICOASE)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Advanced Science and Engineering (ICOASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOASE51841.2020.9436540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
In the recently developed document clustering, word embedding has the primary role in constructing semantics, considering and measuring the times a specific word appears in its context. Word2vect and Glove word embedding are the two most used word embeddings in document clustering. Previous works do not consider the use of glove word embedding with DBSCAN clustering algorithm in document clustering. In this work, a preprocessing with and without stemming of Wikipedia and IMDB datasets applied to glove word embedding algorithm, then word vectors as a result are applied to the DBSCAN clustering algorithm. For the evaluation of experiments, seven metrics have been used: Silhouette average, purity, accuracy, F1, completeness, homogeneity, and NMI score. The experimental results are compared with the results of TFIDF and K-means algorithms on six datasets. The results of this work outperform the results of the TFIDF and K-means approach using the four main evaluation metrics and CPU time consuming.