Mux-Kmeans: multiplex kmeans for clustering large-scale data set

Chen Li, Yanfeng Zhang, Ming-hai Jiao, Ge Yu
{"title":"Mux-Kmeans: multiplex kmeans for clustering large-scale data set","authors":"Chen Li, Yanfeng Zhang, Ming-hai Jiao, Ge Yu","doi":"10.1145/2608029.2608033","DOIUrl":null,"url":null,"abstract":"Kmeans clustering algorithm is widely used in a number of scientific applications due to its simple iterative nature and ease of implementation. The quality of clustering result highly depends on the selection of initial centroids. Different selections of initial centroids result in different clustering results. In practice, people run a series of Kmeans processes with multiple initial centroid groups serially and return the best clustering result among them. However, in the era of big data, a Kmeans process is implemented on MapReduce to scale to large data sets. Even a single Kmeans process on MapReduce requires considerable long runtime. This paper proposes Mux-Kmeans. Rather than executing multiple Kmeans processes serially, Mux-Kmeans launches these Kmeans processes concurrently with multiple centroid groups. In each iteration, Mux-Kmeans (i) evaluates these Kmeans processes, (ii) prunes the low-quality Kmeans processes, and (iii) incubates new Kmeans processes. After a certain number of iterations, it finally obtains the best among these local optimal results. We implement Mux-Kmeans on MapReduce and evaluate it on Amazon EC2. The experimental results show that starting from the same initial centroid groups, the clustering result of Mux-Kmeans is always non-worse than the best of a series of Kmeans processes. Mux-Kmeans also saves elapsed time than serial multiple Kmeans processes.","PeriodicalId":443577,"journal":{"name":"Scientific Cloud Computing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2608029.2608033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Kmeans clustering algorithm is widely used in a number of scientific applications due to its simple iterative nature and ease of implementation. The quality of clustering result highly depends on the selection of initial centroids. Different selections of initial centroids result in different clustering results. In practice, people run a series of Kmeans processes with multiple initial centroid groups serially and return the best clustering result among them. However, in the era of big data, a Kmeans process is implemented on MapReduce to scale to large data sets. Even a single Kmeans process on MapReduce requires considerable long runtime. This paper proposes Mux-Kmeans. Rather than executing multiple Kmeans processes serially, Mux-Kmeans launches these Kmeans processes concurrently with multiple centroid groups. In each iteration, Mux-Kmeans (i) evaluates these Kmeans processes, (ii) prunes the low-quality Kmeans processes, and (iii) incubates new Kmeans processes. After a certain number of iterations, it finally obtains the best among these local optimal results. We implement Mux-Kmeans on MapReduce and evaluate it on Amazon EC2. The experimental results show that starting from the same initial centroid groups, the clustering result of Mux-Kmeans is always non-worse than the best of a series of Kmeans processes. Mux-Kmeans also saves elapsed time than serial multiple Kmeans processes.
multi - kmeans:用于大规模数据集聚类的多重kmeans
Kmeans聚类算法由于其简单迭代和易于实现的特点,在许多科学应用中得到了广泛的应用。聚类结果的质量很大程度上取决于初始质心的选择。初始质心的选择不同,聚类结果也不同。在实践中,人们连续运行一系列具有多个初始质心组的Kmeans过程,并返回其中的最佳聚类结果。然而,在大数据时代,在MapReduce上实现了一个Kmeans进程来扩展到大数据集。即使是MapReduce上的单个Kmeans进程也需要相当长的运行时间。本文提出了Mux-Kmeans。与连续执行多个Kmeans进程不同,Mux-Kmeans使用多个质心组并发地启动这些Kmeans进程。在每次迭代中,Mux-Kmeans (i)评估这些Kmeans过程,(ii)修剪低质量的Kmeans过程,(iii)孵化新的Kmeans过程。经过一定次数的迭代,最终在这些局部最优结果中求得最优。我们在MapReduce上实现了Mux-Kmeans,并在Amazon EC2上对其进行了评估。实验结果表明,从相同的初始质心群出发,Mux-Kmeans聚类结果总是不差于一系列Kmeans过程中的最佳聚类结果。与串行多个Kmeans进程相比,Mux-Kmeans也节省了运行时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信