{"title":"Cluster Correction on Polysemy and Synonymy","authors":"Zemin Qin, Hao Lian, Tieke He, B. Luo","doi":"10.1109/WISA.2017.45","DOIUrl":null,"url":null,"abstract":"Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering. At the same time, there are still many challenges, for example the accuracy of clustering needs to be improved. In this regard, the process of cluster correction becomes the object of analysis. In this paper, we focus on the polysemy and synonymy issue in clustering process. Polysemy represents the ambiguity of an individual word or phrase that can be used (in different contexts) to express two or more different meanings. However, synonymy is the semantic relation that holds between two or more words that can (in a given context) express the same meaning. These two conditions will affect our results of clustering. In order that, we use bag of words model to distinguish contexts of the same words and word2vec to re-cluster word with the similar meaning. Cosine similarity is also use to measure of similarity between two nonzero vectors in these two model.","PeriodicalId":204706,"journal":{"name":"2017 14th Web Information Systems and Applications Conference (WISA)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th Web Information Systems and Applications Conference (WISA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISA.2017.45","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering. At the same time, there are still many challenges, for example the accuracy of clustering needs to be improved. In this regard, the process of cluster correction becomes the object of analysis. In this paper, we focus on the polysemy and synonymy issue in clustering process. Polysemy represents the ambiguity of an individual word or phrase that can be used (in different contexts) to express two or more different meanings. However, synonymy is the semantic relation that holds between two or more words that can (in a given context) express the same meaning. These two conditions will affect our results of clustering. In order that, we use bag of words model to distinguish contexts of the same words and word2vec to re-cluster word with the similar meaning. Cosine similarity is also use to measure of similarity between two nonzero vectors in these two model.