Efficient Distributed Data Clustering on Spark

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI:10.1109/CLUSTER.2015.84

Jia Li, Dongsheng Li, Yiming Zhang

引用次数: 6

Abstract

Data clustering is usually time-consuming since it by default needs to iteratively aggregate and process large volume of data. Approximate aggregation based on sample provides fast and quality ensured results. In this paper, we propose to leverage approximation techniques to data clustering to obtain the trade-off between clustering efficiency and result quality, along with online accuracy estimation. The proposed method is based on the bootstrap trials. We implemented this method as an Intelligent Bootstrap Library (IBL) on Spark to support efficient data clustering. Intensive evaluations show that IBL can provide a 2x speed-up over the state of art solution with the same error bound.

查看原文本刊更多论文

基于Spark的高效分布式数据聚类

数据聚类通常非常耗时，因为默认情况下需要迭代地聚合和处理大量数据。基于样本的近似聚合提供了快速和有质量保证的结果。在本文中，我们提出利用近似技术进行数据聚类，以获得聚类效率和结果质量之间的权衡，以及在线精度估计。该方法是基于自举试验的。我们将这种方法作为智能引导库(Intelligent Bootstrap Library, IBL)在Spark上实现，以支持高效的数据聚类。密集的评估表明，在相同的误差范围内，IBL可以提供比目前最先进的解决方案2倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量