基于hadoop的K-means算法的研究与改进

2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS) Pub Date : 2015-11-30 DOI:10.1109/ICSESS.2015.7339068

Kehe Wu, Wenjing Zeng, Tingting Wu, Yanwen An

{"title":"基于hadoop的K-means算法的研究与改进","authors":"Kehe Wu, Wenjing Zeng, Tingting Wu, Yanwen An","doi":"10.1109/ICSESS.2015.7339068","DOIUrl":null,"url":null,"abstract":"With the advent of the big data era, traditional data mining algorithm becomes incompetent for the task of massive data analysis, management and mining. The development of cloud computing brings new life to algorithm parallelization. In this paper, we have studied the K-means algorithm, one of the clustering algorithm. Then we attempt to improves this algorithm via the method that sample the large-scale data and use convex hull and opposite Chung points to solve the initial two cluster centers. We also take the MapReduce programming model to parallelize the whole process. Finally, using the Reuters news set 21578 as a data source, comparative experiments under different distance measure, serial to parallel, and different cluster nodes have been done to verify the efficiency of the improved algorithm. Results show that compared with serial algorithm, the improved parallel algorithm improves obviously both in reliability and efficiency with the increase of cluster nodes and data size.","PeriodicalId":335871,"journal":{"name":"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Research and improve on K-means algorithm based on hadoop\",\"authors\":\"Kehe Wu, Wenjing Zeng, Tingting Wu, Yanwen An\",\"doi\":\"10.1109/ICSESS.2015.7339068\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the advent of the big data era, traditional data mining algorithm becomes incompetent for the task of massive data analysis, management and mining. The development of cloud computing brings new life to algorithm parallelization. In this paper, we have studied the K-means algorithm, one of the clustering algorithm. Then we attempt to improves this algorithm via the method that sample the large-scale data and use convex hull and opposite Chung points to solve the initial two cluster centers. We also take the MapReduce programming model to parallelize the whole process. Finally, using the Reuters news set 21578 as a data source, comparative experiments under different distance measure, serial to parallel, and different cluster nodes have been done to verify the efficiency of the improved algorithm. Results show that compared with serial algorithm, the improved parallel algorithm improves obviously both in reliability and efficiency with the increase of cluster nodes and data size.\",\"PeriodicalId\":335871,\"journal\":{\"name\":\"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)\",\"volume\":\"89 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSESS.2015.7339068\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSESS.2015.7339068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

随着大数据时代的到来，传统的数据挖掘算法已经无法胜任海量数据的分析、管理和挖掘任务。云计算的发展给并行化算法带来了新的生机。本文研究了聚类算法中的一种K-means算法。然后，我们尝试通过对大规模数据进行采样，并使用凸包和相反的Chung点来求解初始两个聚类中心的方法来改进该算法。我们还采用MapReduce编程模型来并行化整个过程。最后，以路透社新闻集21578为数据源，进行了不同距离度量、串行到并行、不同集群节点下的对比实验，验证了改进算法的有效性。结果表明，与串行算法相比，改进后的并行算法随着集群节点数和数据量的增加，在可靠性和效率上都有明显提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Research and improve on K-means algorithm based on hadoop

With the advent of the big data era, traditional data mining algorithm becomes incompetent for the task of massive data analysis, management and mining. The development of cloud computing brings new life to algorithm parallelization. In this paper, we have studied the K-means algorithm, one of the clustering algorithm. Then we attempt to improves this algorithm via the method that sample the large-scale data and use convex hull and opposite Chung points to solve the initial two cluster centers. We also take the MapReduce programming model to parallelize the whole process. Finally, using the Reuters news set 21578 as a data source, comparative experiments under different distance measure, serial to parallel, and different cluster nodes have been done to verify the efficiency of the improved algorithm. Results show that compared with serial algorithm, the improved parallel algorithm improves obviously both in reliability and efficiency with the increase of cluster nodes and data size.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)

自引率

0.00%

发文量