一种用于处理大数据的协同分治k均值聚类算法

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI:10.1145/2597917.2597918

Huimin Cui, G. Ruan, Jingling Xue, Rui Xie, Lei Wang, Xiaobing Feng

{"title":"一种用于处理大数据的协同分治k均值聚类算法","authors":"Huimin Cui, G. Ruan, Jingling Xue, Rui Xie, Lei Wang, Xiaobing Feng","doi":"10.1145/2597917.2597918","DOIUrl":null,"url":null,"abstract":"K-means clustering plays a vital role in data mining. As an iterative computation, its performance will suffer when applied to tremendous amounts of data, due to poor temporal locality across its iterations. The state-of-the-art streaming algorithm, which streams the data from disk into memory and operates on the partitioned streams, improves temporal locality but can misplace objects in clusters since different partitions are processed locally. This paper presents a collaborative divide-and-conquer algorithm to significantly improve the state-of-the-art, based on two key insights. First, we introduce a break-and-recluster procedure to identify the clusters with misplaced objects. Second, we introduce collaborative seeding between different partitions to accelerate the convergence inside each partition. Compared with the streaming algorithm using a number of wikipedia webpages as our datasets, our collaborative algorithm improves its clustering quality by up to 35.3% with an average of 8.8% while decreasing its execution times from 0.3% to 80.1% with an average of 48.6%.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"7 3-4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"A collaborative divide-and-conquer K-means clustering algorithm for processing large data\",\"authors\":\"Huimin Cui, G. Ruan, Jingling Xue, Rui Xie, Lei Wang, Xiaobing Feng\",\"doi\":\"10.1145/2597917.2597918\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"K-means clustering plays a vital role in data mining. As an iterative computation, its performance will suffer when applied to tremendous amounts of data, due to poor temporal locality across its iterations. The state-of-the-art streaming algorithm, which streams the data from disk into memory and operates on the partitioned streams, improves temporal locality but can misplace objects in clusters since different partitions are processed locally. This paper presents a collaborative divide-and-conquer algorithm to significantly improve the state-of-the-art, based on two key insights. First, we introduce a break-and-recluster procedure to identify the clusters with misplaced objects. Second, we introduce collaborative seeding between different partitions to accelerate the convergence inside each partition. Compared with the streaming algorithm using a number of wikipedia webpages as our datasets, our collaborative algorithm improves its clustering quality by up to 35.3% with an average of 8.8% while decreasing its execution times from 0.3% to 80.1% with an average of 48.6%.\",\"PeriodicalId\":194910,\"journal\":{\"name\":\"Proceedings of the 11th ACM Conference on Computing Frontiers\",\"volume\":\"7 3-4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 11th ACM Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2597917.2597918\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2597917.2597918","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

k均值聚类在数据挖掘中起着至关重要的作用。作为一种迭代计算，由于迭代过程中的时间局部性差，当应用于大量数据时，其性能将受到影响。最先进的流算法将数据从磁盘流到内存，并对分区流进行操作，提高了时间局部性，但由于不同的分区是在本地处理的，因此可能会在集群中放置错误的对象。本文提出了一种基于两个关键见解的协作分而治之算法，以显着提高最先进的技术。首先，我们引入了一个中断-重新聚类过程来识别对象放错位置的聚类。其次，在不同分区之间引入协同播种，加快各分区内部的收敛速度;与使用大量维基百科网页作为数据集的流式算法相比，我们的协同算法的聚类质量提高了35.3%，平均提高了8.8%，执行时间从0.3%降低到80.1%，平均降低了48.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A collaborative divide-and-conquer K-means clustering algorithm for processing large data

K-means clustering plays a vital role in data mining. As an iterative computation, its performance will suffer when applied to tremendous amounts of data, due to poor temporal locality across its iterations. The state-of-the-art streaming algorithm, which streams the data from disk into memory and operates on the partitioned streams, improves temporal locality but can misplace objects in clusters since different partitions are processed locally. This paper presents a collaborative divide-and-conquer algorithm to significantly improve the state-of-the-art, based on two key insights. First, we introduce a break-and-recluster procedure to identify the clusters with misplaced objects. Second, we introduce collaborative seeding between different partitions to accelerate the convergence inside each partition. Compared with the streaming algorithm using a number of wikipedia webpages as our datasets, our collaborative algorithm improves its clustering quality by up to 35.3% with an average of 8.8% while decreasing its execution times from 0.3% to 80.1% with an average of 48.6%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 11th ACM Conference on Computing Frontiers

自引率

0.00%

发文量