{"title":"基于hadoop的K-means算法的研究与改进","authors":"Kehe Wu, Wenjing Zeng, Tingting Wu, Yanwen An","doi":"10.1109/ICSESS.2015.7339068","DOIUrl":null,"url":null,"abstract":"With the advent of the big data era, traditional data mining algorithm becomes incompetent for the task of massive data analysis, management and mining. The development of cloud computing brings new life to algorithm parallelization. In this paper, we have studied the K-means algorithm, one of the clustering algorithm. Then we attempt to improves this algorithm via the method that sample the large-scale data and use convex hull and opposite Chung points to solve the initial two cluster centers. We also take the MapReduce programming model to parallelize the whole process. Finally, using the Reuters news set 21578 as a data source, comparative experiments under different distance measure, serial to parallel, and different cluster nodes have been done to verify the efficiency of the improved algorithm. Results show that compared with serial algorithm, the improved parallel algorithm improves obviously both in reliability and efficiency with the increase of cluster nodes and data size.","PeriodicalId":335871,"journal":{"name":"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Research and improve on K-means algorithm based on hadoop\",\"authors\":\"Kehe Wu, Wenjing Zeng, Tingting Wu, Yanwen An\",\"doi\":\"10.1109/ICSESS.2015.7339068\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the advent of the big data era, traditional data mining algorithm becomes incompetent for the task of massive data analysis, management and mining. The development of cloud computing brings new life to algorithm parallelization. In this paper, we have studied the K-means algorithm, one of the clustering algorithm. Then we attempt to improves this algorithm via the method that sample the large-scale data and use convex hull and opposite Chung points to solve the initial two cluster centers. We also take the MapReduce programming model to parallelize the whole process. Finally, using the Reuters news set 21578 as a data source, comparative experiments under different distance measure, serial to parallel, and different cluster nodes have been done to verify the efficiency of the improved algorithm. Results show that compared with serial algorithm, the improved parallel algorithm improves obviously both in reliability and efficiency with the increase of cluster nodes and data size.\",\"PeriodicalId\":335871,\"journal\":{\"name\":\"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)\",\"volume\":\"89 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSESS.2015.7339068\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSESS.2015.7339068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Research and improve on K-means algorithm based on hadoop
With the advent of the big data era, traditional data mining algorithm becomes incompetent for the task of massive data analysis, management and mining. The development of cloud computing brings new life to algorithm parallelization. In this paper, we have studied the K-means algorithm, one of the clustering algorithm. Then we attempt to improves this algorithm via the method that sample the large-scale data and use convex hull and opposite Chung points to solve the initial two cluster centers. We also take the MapReduce programming model to parallelize the whole process. Finally, using the Reuters news set 21578 as a data source, comparative experiments under different distance measure, serial to parallel, and different cluster nodes have been done to verify the efficiency of the improved algorithm. Results show that compared with serial algorithm, the improved parallel algorithm improves obviously both in reliability and efficiency with the increase of cluster nodes and data size.