Huimin Cui, G. Ruan, Jingling Xue, Rui Xie, Lei Wang, Xiaobing Feng
{"title":"一种用于处理大数据的协同分治k均值聚类算法","authors":"Huimin Cui, G. Ruan, Jingling Xue, Rui Xie, Lei Wang, Xiaobing Feng","doi":"10.1145/2597917.2597918","DOIUrl":null,"url":null,"abstract":"K-means clustering plays a vital role in data mining. As an iterative computation, its performance will suffer when applied to tremendous amounts of data, due to poor temporal locality across its iterations. The state-of-the-art streaming algorithm, which streams the data from disk into memory and operates on the partitioned streams, improves temporal locality but can misplace objects in clusters since different partitions are processed locally. This paper presents a collaborative divide-and-conquer algorithm to significantly improve the state-of-the-art, based on two key insights. First, we introduce a break-and-recluster procedure to identify the clusters with misplaced objects. Second, we introduce collaborative seeding between different partitions to accelerate the convergence inside each partition. Compared with the streaming algorithm using a number of wikipedia webpages as our datasets, our collaborative algorithm improves its clustering quality by up to 35.3% with an average of 8.8% while decreasing its execution times from 0.3% to 80.1% with an average of 48.6%.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"7 3-4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"A collaborative divide-and-conquer K-means clustering algorithm for processing large data\",\"authors\":\"Huimin Cui, G. Ruan, Jingling Xue, Rui Xie, Lei Wang, Xiaobing Feng\",\"doi\":\"10.1145/2597917.2597918\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"K-means clustering plays a vital role in data mining. As an iterative computation, its performance will suffer when applied to tremendous amounts of data, due to poor temporal locality across its iterations. The state-of-the-art streaming algorithm, which streams the data from disk into memory and operates on the partitioned streams, improves temporal locality but can misplace objects in clusters since different partitions are processed locally. This paper presents a collaborative divide-and-conquer algorithm to significantly improve the state-of-the-art, based on two key insights. First, we introduce a break-and-recluster procedure to identify the clusters with misplaced objects. Second, we introduce collaborative seeding between different partitions to accelerate the convergence inside each partition. Compared with the streaming algorithm using a number of wikipedia webpages as our datasets, our collaborative algorithm improves its clustering quality by up to 35.3% with an average of 8.8% while decreasing its execution times from 0.3% to 80.1% with an average of 48.6%.\",\"PeriodicalId\":194910,\"journal\":{\"name\":\"Proceedings of the 11th ACM Conference on Computing Frontiers\",\"volume\":\"7 3-4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 11th ACM Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2597917.2597918\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2597917.2597918","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A collaborative divide-and-conquer K-means clustering algorithm for processing large data
K-means clustering plays a vital role in data mining. As an iterative computation, its performance will suffer when applied to tremendous amounts of data, due to poor temporal locality across its iterations. The state-of-the-art streaming algorithm, which streams the data from disk into memory and operates on the partitioned streams, improves temporal locality but can misplace objects in clusters since different partitions are processed locally. This paper presents a collaborative divide-and-conquer algorithm to significantly improve the state-of-the-art, based on two key insights. First, we introduce a break-and-recluster procedure to identify the clusters with misplaced objects. Second, we introduce collaborative seeding between different partitions to accelerate the convergence inside each partition. Compared with the streaming algorithm using a number of wikipedia webpages as our datasets, our collaborative algorithm improves its clustering quality by up to 35.3% with an average of 8.8% while decreasing its execution times from 0.3% to 80.1% with an average of 48.6%.