O-Cluster: scalable clustering of large high dimensional data sets

2002 IEEE International Conference on Data Mining, 2002. Proceedings. Pub Date : 2002-12-09 DOI:10.1109/ICDM.2002.1183915

B. Milenova, M. Campos

引用次数: 82

Abstract

Clustering large data sets of high dimensionality has always been a challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data sets with a very large number of records and/or with a very high number of dimensions. We provide a discussion of the advantages and limitations of existing algorithms when they operate on very large multidimensional data sets. To simultaneously overcome both the "curse of dimensionality" and the scalability problems associated with large amounts of data, we propose a new clustering algorithm called O-Cluster. O-Cluster combines a novel active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. The method operates on a limited memory buffer and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, their robustness to noise, and O-Cluster's excellent scalability.

查看原文本刊更多论文

O-Cluster:大型高维数据集的可扩展聚类

高维大数据集的聚类一直是聚类算法面临的挑战。最近开发的许多聚类算法都试图处理具有非常大量记录和/或具有非常高维度的数据集。我们讨论了现有算法在处理非常大的多维数据集时的优点和局限性。为了同时克服“维数诅咒”和与大量数据相关的可扩展性问题，我们提出了一种新的聚类算法，称为O-Cluster。O-Cluster结合了一种新颖的主动采样技术和轴平行划分策略来识别输入空间中高密度的连续区域。该方法在有限的内存缓冲区上操作，最多只需要对数据进行一次扫描。我们证明了所获得的聚类解决方案的高质量，它们对噪声的鲁棒性以及O-Cluster出色的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2002 IEEE International Conference on Data Mining, 2002. Proceedings.

自引率

0.00%

发文量