Mux-Kmeans: multiplex kmeans for clustering large-scale data set

Scientific Cloud Computing Pub Date : 2014-06-23 DOI:10.1145/2608029.2608033

Chen Li, Yanfeng Zhang, Ming-hai Jiao, Ge Yu

{"title":"Mux-Kmeans: multiplex kmeans for clustering large-scale data set","authors":"Chen Li, Yanfeng Zhang, Ming-hai Jiao, Ge Yu","doi":"10.1145/2608029.2608033","DOIUrl":null,"url":null,"abstract":"Kmeans clustering algorithm is widely used in a number of scientific applications due to its simple iterative nature and ease of implementation. The quality of clustering result highly depends on the selection of initial centroids. Different selections of initial centroids result in different clustering results. In practice, people run a series of Kmeans processes with multiple initial centroid groups serially and return the best clustering result among them. However, in the era of big data, a Kmeans process is implemented on MapReduce to scale to large data sets. Even a single Kmeans process on MapReduce requires considerable long runtime. This paper proposes Mux-Kmeans. Rather than executing multiple Kmeans processes serially, Mux-Kmeans launches these Kmeans processes concurrently with multiple centroid groups. In each iteration, Mux-Kmeans (i) evaluates these Kmeans processes, (ii) prunes the low-quality Kmeans processes, and (iii) incubates new Kmeans processes. After a certain number of iterations, it finally obtains the best among these local optimal results. We implement Mux-Kmeans on MapReduce and evaluate it on Amazon EC2. The experimental results show that starting from the same initial centroid groups, the clustering result of Mux-Kmeans is always non-worse than the best of a series of Kmeans processes. Mux-Kmeans also saves elapsed time than serial multiple Kmeans processes.","PeriodicalId":443577,"journal":{"name":"Scientific Cloud Computing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2608029.2608033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Kmeans clustering algorithm is widely used in a number of scientific applications due to its simple iterative nature and ease of implementation. The quality of clustering result highly depends on the selection of initial centroids. Different selections of initial centroids result in different clustering results. In practice, people run a series of Kmeans processes with multiple initial centroid groups serially and return the best clustering result among them. However, in the era of big data, a Kmeans process is implemented on MapReduce to scale to large data sets. Even a single Kmeans process on MapReduce requires considerable long runtime. This paper proposes Mux-Kmeans. Rather than executing multiple Kmeans processes serially, Mux-Kmeans launches these Kmeans processes concurrently with multiple centroid groups. In each iteration, Mux-Kmeans (i) evaluates these Kmeans processes, (ii) prunes the low-quality Kmeans processes, and (iii) incubates new Kmeans processes. After a certain number of iterations, it finally obtains the best among these local optimal results. We implement Mux-Kmeans on MapReduce and evaluate it on Amazon EC2. The experimental results show that starting from the same initial centroid groups, the clustering result of Mux-Kmeans is always non-worse than the best of a series of Kmeans processes. Mux-Kmeans also saves elapsed time than serial multiple Kmeans processes.

查看原文本刊更多论文

multi - kmeans:用于大规模数据集聚类的多重kmeans

Kmeans聚类算法由于其简单迭代和易于实现的特点，在许多科学应用中得到了广泛的应用。聚类结果的质量很大程度上取决于初始质心的选择。初始质心的选择不同，聚类结果也不同。在实践中，人们连续运行一系列具有多个初始质心组的Kmeans过程，并返回其中的最佳聚类结果。然而，在大数据时代，在MapReduce上实现了一个Kmeans进程来扩展到大数据集。即使是MapReduce上的单个Kmeans进程也需要相当长的运行时间。本文提出了Mux-Kmeans。与连续执行多个Kmeans进程不同，Mux-Kmeans使用多个质心组并发地启动这些Kmeans进程。在每次迭代中，Mux-Kmeans (i)评估这些Kmeans过程，(ii)修剪低质量的Kmeans过程，(iii)孵化新的Kmeans过程。经过一定次数的迭代，最终在这些局部最优结果中求得最优。我们在MapReduce上实现了Mux-Kmeans，并在Amazon EC2上对其进行了评估。实验结果表明，从相同的初始质心群出发，Mux-Kmeans聚类结果总是不差于一系列Kmeans过程中的最佳聚类结果。与串行多个Kmeans进程相比，Mux-Kmeans也节省了运行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Scientific Cloud Computing

自引率

0.00%

发文量