Haocheng Wang, Fuzhen Zhuang, Xiang Ao, Qing He, Zhongzhi Shi
{"title":"用于海量数据的可伸缩自举集群","authors":"Haocheng Wang, Fuzhen Zhuang, Xiang Ao, Qing He, Zhongzhi Shi","doi":"10.1109/SNPD.2014.6888693","DOIUrl":null,"url":null,"abstract":"The bootstrap provides a simple and powerful means of improving the accuracy of clustering. However, for today's increasingly large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. In this paper we introduce the Bag of Little Bootstraps Clustering (BLBC), a new procedure which utilizes the Bag of Little Bootstraps technique to obtain a robust, computationally efficient means of clustering for massive data. Moreover, BLBC is suited to implementation on modern parallel and distributed computing architectures which are often used to process large datasets. We investigate empirically the performance characteristics of BLBC and compare to the performances of existing methods via experiments on simulated data and real data. The results show that BLBC has a significantly more favorable computational profile than the bootstrap based clustering while maintaining good statistical correctness.","PeriodicalId":272932,"journal":{"name":"15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Scalable bootstrap clustering for massive data\",\"authors\":\"Haocheng Wang, Fuzhen Zhuang, Xiang Ao, Qing He, Zhongzhi Shi\",\"doi\":\"10.1109/SNPD.2014.6888693\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The bootstrap provides a simple and powerful means of improving the accuracy of clustering. However, for today's increasingly large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. In this paper we introduce the Bag of Little Bootstraps Clustering (BLBC), a new procedure which utilizes the Bag of Little Bootstraps technique to obtain a robust, computationally efficient means of clustering for massive data. Moreover, BLBC is suited to implementation on modern parallel and distributed computing architectures which are often used to process large datasets. We investigate empirically the performance characteristics of BLBC and compare to the performances of existing methods via experiments on simulated data and real data. The results show that BLBC has a significantly more favorable computational profile than the bootstrap based clustering while maintaining good statistical correctness.\",\"PeriodicalId\":272932,\"journal\":{\"name\":\"15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)\",\"volume\":\"95 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SNPD.2014.6888693\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SNPD.2014.6888693","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
自举提供了一种简单而强大的方法来提高聚类的准确性。然而,对于今天越来越大的数据集,基于自举的数量的计算可能要求过高。本文介绍了一种新的聚类方法BLBC (Bag of Little bootstrap Clustering),它利用Bag of Little bootstrap技术获得了一种鲁棒的、计算效率高的海量数据聚类方法。此外,BLBC适合在现代并行和分布式计算架构上实现,这些架构通常用于处理大型数据集。我们通过模拟数据和真实数据的实验,对BLBC的性能特点进行了实证研究,并对现有方法的性能进行了比较。结果表明,在保持良好的统计正确性的同时,BLBC具有明显优于基于自举的聚类的计算轮廓。
The bootstrap provides a simple and powerful means of improving the accuracy of clustering. However, for today's increasingly large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. In this paper we introduce the Bag of Little Bootstraps Clustering (BLBC), a new procedure which utilizes the Bag of Little Bootstraps technique to obtain a robust, computationally efficient means of clustering for massive data. Moreover, BLBC is suited to implementation on modern parallel and distributed computing architectures which are often used to process large datasets. We investigate empirically the performance characteristics of BLBC and compare to the performances of existing methods via experiments on simulated data and real data. The results show that BLBC has a significantly more favorable computational profile than the bootstrap based clustering while maintaining good statistical correctness.