Parallelizing clustering of geoscientific data sets using data streams

Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004. Pub Date : 2004-06-21 DOI:10.1109/SSDBM.2004.58

Silvia Nittel, Kelvin T. Leung

引用次数: 11

Abstract

Computing data mining algorithms such as clustering on massive geospatial data sets is still not feasible nor efficient today. In this paper, we introduce a k-means algorithm that is based on the data stream paradigm. The so-called partial/merge k-means algorithm is implemented as a set of data stream operators which are adaptable to available computing resources such as volatile memory and processing power. The partial data stream operator consumes as much data as can befit into RAM, and performs a weighted k-means on the data subset. Subsequently, the weighted partial results are merged by a second data stream operator. All operators can be cloned, and parallelized. In our analytical and experimental performance evaluation, we demonstrate that the partial/merge k-means can outperform a one-step algorithm by a large margin with regard to overall computation time and clustering quality with increasing data density per grid cell.

查看原文本刊更多论文

使用数据流的地球科学数据集的并行聚类

计算数据挖掘算法，如对大量地理空间数据集进行聚类，目前仍然不可行，效率也不高。本文介绍了一种基于数据流范式的k-means算法。所谓的部分/合并k-means算法是作为一组数据流运算符实现的，这些运算符可适应可用的计算资源，如易失性存储器和处理能力。部分数据流运算符消耗尽可能多的数据，并对数据子集执行加权k-means。随后，加权部分结果由第二个数据流算子合并。所有操作符都可以克隆和并行化。在我们的分析和实验性能评估中，我们证明，随着每个网格单元数据密度的增加，在总体计算时间和聚类质量方面，部分/合并k-means可以大大优于一步算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.

自引率

0.00%

发文量