Karwan Jacksi, Rowaida Kh. Ibrahim, Subhi R. M. Zeebaree, R. Zebari, M. A. Sadeeq
{"title":"基于语义相似度的HAC和K-Mean算法聚类文档","authors":"Karwan Jacksi, Rowaida Kh. Ibrahim, Subhi R. M. Zeebaree, R. Zebari, M. A. Sadeeq","doi":"10.1109/ICOASE51841.2020.9436570","DOIUrl":null,"url":null,"abstract":"The continuing success of the Internet has greatly increased the number of text documents in electronic formats. The techniques for grouping these documents into meaningful collections have become mission-critical. The traditional method of compiling documents based on statistical features and grouping did use syntactic rather than semantic. This article introduces a new method for grouping documents based on semantic similarity. This process is accomplished by identifying document summaries from Wikipedia and IMDB datasets, then deriving them using the NLTK dictionary. A vector space afterward is modeled with TFIDF, and the clustering is performed using the HAC and K-mean algorithms. The results are compared and visualized as an interactive webpage.","PeriodicalId":126112,"journal":{"name":"2020 International Conference on Advanced Science and Engineering (ICOASE)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":"{\"title\":\"Clustering Documents based on Semantic Similarity using HAC and K-Mean Algorithms\",\"authors\":\"Karwan Jacksi, Rowaida Kh. Ibrahim, Subhi R. M. Zeebaree, R. Zebari, M. A. Sadeeq\",\"doi\":\"10.1109/ICOASE51841.2020.9436570\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The continuing success of the Internet has greatly increased the number of text documents in electronic formats. The techniques for grouping these documents into meaningful collections have become mission-critical. The traditional method of compiling documents based on statistical features and grouping did use syntactic rather than semantic. This article introduces a new method for grouping documents based on semantic similarity. This process is accomplished by identifying document summaries from Wikipedia and IMDB datasets, then deriving them using the NLTK dictionary. A vector space afterward is modeled with TFIDF, and the clustering is performed using the HAC and K-mean algorithms. The results are compared and visualized as an interactive webpage.\",\"PeriodicalId\":126112,\"journal\":{\"name\":\"2020 International Conference on Advanced Science and Engineering (ICOASE)\",\"volume\":\"57 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"24\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Conference on Advanced Science and Engineering (ICOASE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICOASE51841.2020.9436570\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Advanced Science and Engineering (ICOASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOASE51841.2020.9436570","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Clustering Documents based on Semantic Similarity using HAC and K-Mean Algorithms
The continuing success of the Internet has greatly increased the number of text documents in electronic formats. The techniques for grouping these documents into meaningful collections have become mission-critical. The traditional method of compiling documents based on statistical features and grouping did use syntactic rather than semantic. This article introduces a new method for grouping documents based on semantic similarity. This process is accomplished by identifying document summaries from Wikipedia and IMDB datasets, then deriving them using the NLTK dictionary. A vector space afterward is modeled with TFIDF, and the clustering is performed using the HAC and K-mean algorithms. The results are compared and visualized as an interactive webpage.