Iterative clustering of high dimensional text data augmented by local search

2002 IEEE International Conference on Data Mining, 2002. Proceedings. Pub Date : 2002-12-09 DOI:10.1109/ICDM.2002.1183895

I. Dhillon, Yuqiang Guan, J. Kogan

{"title":"Iterative clustering of high dimensional text data augmented by local search","authors":"I. Dhillon, Yuqiang Guan, J. Kogan","doi":"10.1109/ICDM.2002.1183895","DOIUrl":null,"url":null,"abstract":"The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However spherical k-means can often yield qualitatively poor results, especially when cluster sizes are small, say 25-30 documents per cluster, where it tends to get stuck at a local maximum far away from the optimal solution. In this paper, we present a local search procedure, which we call 'first-variation\" that refines a given clustering by incrementally moving data points between clusters, thus achieving a higher objective function value. An enhancement of first variation allows a chain of such moves in a Kernighan-Lin fashion and leads to a better local maximum. Combining the enhanced first-variation with spherical k-means yields a powerful \"ping-pong\" strategy that often qualitatively improves k-means clustering and is computationally efficient. We present several experimental results to highlight the improvement achieved by our proposed algorithm in clustering high-dimensional and sparse text data.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"155","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2002.1183895","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 155

Abstract

The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However spherical k-means can often yield qualitatively poor results, especially when cluster sizes are small, say 25-30 documents per cluster, where it tends to get stuck at a local maximum far away from the optimal solution. In this paper, we present a local search procedure, which we call 'first-variation" that refines a given clustering by incrementally moving data points between clusters, thus achieving a higher objective function value. An enhancement of first variation allows a chain of such moves in a Kernighan-Lin fashion and leads to a better local maximum. Combining the enhanced first-variation with spherical k-means yields a powerful "ping-pong" strategy that often qualitatively improves k-means clustering and is computationally efficient. We present several experimental results to highlight the improvement achieved by our proposed algorithm in clustering high-dimensional and sparse text data.

查看原文本刊更多论文

局部搜索增强的高维文本数据迭代聚类

具有余弦相似度的k-means算法，也称为球形k-means算法，是一种常用的文档集合聚类方法。然而，球形k-means通常会产生质量较差的结果，特别是当集群规模较小时，例如每个集群25-30个文档，它往往会陷入远离最佳解决方案的局部最大值。在本文中，我们提出了一种局部搜索过程，我们称之为“第一变量”，它通过在聚类之间增量移动数据点来改进给定的聚类，从而获得更高的目标函数值。第一次变差的增强允许以Kernighan-Lin方式进行一系列这样的移动，并导致更好的局部最大值。将增强的第一变量与球形k-means相结合，产生了一种强大的“乒乓”策略，通常可以定性地改进k-means聚类，并且具有计算效率。我们给出了几个实验结果，以突出我们提出的算法在高维和稀疏文本数据聚类方面所取得的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2002 IEEE International Conference on Data Mining, 2002. Proceedings.

自引率

0.00%

发文量