基于WRD和改进K-means的中文文本聚类算法

IF 0.8 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Intelligent Data Analysis Pub Date : 2023-06-01 DOI:10.3233/ida-226652

Zicai Cui, Bocheng Zhong, Chen Bai

{"title":"基于WRD和改进K-means的中文文本聚类算法","authors":"Zicai Cui, Bocheng Zhong, Chen Bai","doi":"10.3233/ida-226652","DOIUrl":null,"url":null,"abstract":"Text clustering has been widely used in data mining, document management, search engines, and other fields. The K-means algorithm is a representative algorithm of text clustering. However, traditional K-means algorithm often uses Euclidean distance or cosine distance to measure the similarity between texts, which is not effective in face of high-dimensional data and cannot retain enough semantic information. In response to the above problems, we combine word rotator’s distance with the K-means algorithm, and propose the WRDK-means algorithm, which use word rotator’s distance to calculate the similarity between texts and preserve more text features. Furthermore, we define a new cluster center initialization method that improves cluster instability during random initial cluster center selection. And, to solve the problem of inconsistent length between texts, we propose a new iterative approximation method of cluster centers. We selected three suitable datasets and five evaluation indicators to verify the feasibility of the proposed algorithm. Among them, the RI value of our algorithm exceeds 90%. And for Marco_F1, our scheme was about 37.77%, 23.2%, 13.06% and 20.12% better than other four methods, respectively.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"13 1","pages":"1205-1220"},"PeriodicalIF":0.8000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A new Chinese text clustering algorithm based on WRD and improved K-means\",\"authors\":\"Zicai Cui, Bocheng Zhong, Chen Bai\",\"doi\":\"10.3233/ida-226652\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text clustering has been widely used in data mining, document management, search engines, and other fields. The K-means algorithm is a representative algorithm of text clustering. However, traditional K-means algorithm often uses Euclidean distance or cosine distance to measure the similarity between texts, which is not effective in face of high-dimensional data and cannot retain enough semantic information. In response to the above problems, we combine word rotator’s distance with the K-means algorithm, and propose the WRDK-means algorithm, which use word rotator’s distance to calculate the similarity between texts and preserve more text features. Furthermore, we define a new cluster center initialization method that improves cluster instability during random initial cluster center selection. And, to solve the problem of inconsistent length between texts, we propose a new iterative approximation method of cluster centers. We selected three suitable datasets and five evaluation indicators to verify the feasibility of the proposed algorithm. Among them, the RI value of our algorithm exceeds 90%. And for Marco_F1, our scheme was about 37.77%, 23.2%, 13.06% and 20.12% better than other four methods, respectively.\",\"PeriodicalId\":50355,\"journal\":{\"name\":\"Intelligent Data Analysis\",\"volume\":\"13 1\",\"pages\":\"1205-1220\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2023-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligent Data Analysis\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.3233/ida-226652\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/ida-226652","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

文本聚类已广泛应用于数据挖掘、文档管理、搜索引擎等领域。K-means算法是文本聚类的代表性算法。然而，传统的K-means算法通常使用欧几里得距离或余弦距离来度量文本之间的相似度，这在面对高维数据时效果不佳，并且不能保留足够的语义信息。针对上述问题，我们将词旋转器的距离与K-means算法相结合，提出了WRDK-means算法，该算法利用词旋转器的距离来计算文本之间的相似度，从而保留更多的文本特征。此外，我们定义了一种新的聚类中心初始化方法，以改善随机初始聚类中心选择时的聚类不稳定性。为了解决文本长度不一致的问题，我们提出了一种新的聚类中心迭代逼近方法。我们选择了三个合适的数据集和五个评价指标来验证所提出算法的可行性。其中，我们算法的RI值超过90%。对于Marco_F1，我们的方案分别比其他四种方法分别好37.77%、23.2%、13.06%和20.12%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A new Chinese text clustering algorithm based on WRD and improved K-means

Text clustering has been widely used in data mining, document management, search engines, and other fields. The K-means algorithm is a representative algorithm of text clustering. However, traditional K-means algorithm often uses Euclidean distance or cosine distance to measure the similarity between texts, which is not effective in face of high-dimensional data and cannot retain enough semantic information. In response to the above problems, we combine word rotator’s distance with the K-means algorithm, and propose the WRDK-means algorithm, which use word rotator’s distance to calculate the similarity between texts and preserve more text features. Furthermore, we define a new cluster center initialization method that improves cluster instability during random initial cluster center selection. And, to solve the problem of inconsistent length between texts, we propose a new iterative approximation method of cluster centers. We selected three suitable datasets and five evaluation indicators to verify the feasibility of the proposed algorithm. Among them, the RI value of our algorithm exceeds 90%. And for Marco_F1, our scheme was about 37.77%, 23.2%, 13.06% and 20.12% better than other four methods, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Intelligent Data Analysis 工程技术-计算机：人工智能

CiteScore

2.20

自引率

5.90%

发文量

审稿时长

3.3 months

期刊介绍： Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.