Analysis and performance improvement of K-means clustering in big data environment

2015 International Conference on Communication Networks (ICCN) Pub Date : 2015-11-01 DOI:10.1109/ICCN.2015.9

Purva Rathore, Deepak Shukla

{"title":"Analysis and performance improvement of K-means clustering in big data environment","authors":"Purva Rathore, Deepak Shukla","doi":"10.1109/ICCN.2015.9","DOIUrl":null,"url":null,"abstract":"The big data environment is used to support the huge amount of data processing. In this environment tons (i.e. Giga bytes, Tera bytes) of data is processed. Therefore the various online applications where the huge data request are generated are treated using the big data i.e. facebook, google. In this presented work the big data environment is studied and investigated how the data is consumed using the big data and how the supporting tools are working with the Hadoop storage. Furthermore, for keen understanding and investigation, a cluster analysis technique more specifically the K-mean clustering algorithm is implemented through the Hadoop and MapReduce. The clustering is a part of big data analytics where the unlabelled data is processed and utilized to make groups of the data. In addition of that it is observed the traditional k-mean algorithm is not much suitably works with the Hadoop and MapReduce thus small amount of modification is performed on the data processing technique. In addition of that during cluster analysis various issues are found in traditional k-means i.e. fluctuating accuracy, outliers and empty cluster. Therefore a new clustering algorithm with modification on traditional approach of k-means clustering is proposed and implemented. That approach first enhances the data quality by removing the outlier points in datasets and then the bi-part method is used to perform the clustering. The proposed clustering technique implemented using the JAVA, Hadoop and MapReduce finally the performance of the proposed clustering approach is evaluated and compared with the traditional k-means clustering algorithm. The obtained performance shows the effective results and enhanced accuracy of cluster formation with the removal of the de-efficiency. Thus the proposed work is adoptable for the big data environment with improving the performance of clustering.","PeriodicalId":431743,"journal":{"name":"2015 International Conference on Communication Networks (ICCN)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Communication Networks (ICCN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCN.2015.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

The big data environment is used to support the huge amount of data processing. In this environment tons (i.e. Giga bytes, Tera bytes) of data is processed. Therefore the various online applications where the huge data request are generated are treated using the big data i.e. facebook, google. In this presented work the big data environment is studied and investigated how the data is consumed using the big data and how the supporting tools are working with the Hadoop storage. Furthermore, for keen understanding and investigation, a cluster analysis technique more specifically the K-mean clustering algorithm is implemented through the Hadoop and MapReduce. The clustering is a part of big data analytics where the unlabelled data is processed and utilized to make groups of the data. In addition of that it is observed the traditional k-mean algorithm is not much suitably works with the Hadoop and MapReduce thus small amount of modification is performed on the data processing technique. In addition of that during cluster analysis various issues are found in traditional k-means i.e. fluctuating accuracy, outliers and empty cluster. Therefore a new clustering algorithm with modification on traditional approach of k-means clustering is proposed and implemented. That approach first enhances the data quality by removing the outlier points in datasets and then the bi-part method is used to perform the clustering. The proposed clustering technique implemented using the JAVA, Hadoop and MapReduce finally the performance of the proposed clustering approach is evaluated and compared with the traditional k-means clustering algorithm. The obtained performance shows the effective results and enhanced accuracy of cluster formation with the removal of the de-efficiency. Thus the proposed work is adoptable for the big data environment with improving the performance of clustering.

查看原文本刊更多论文

大数据环境下K-means聚类分析及性能改进

使用大数据环境来支持海量的数据处理。在这种环境中，要处理大量(即千兆字节、兆字节)的数据。因此，产生大量数据请求的各种在线应用程序都使用大数据进行处理，例如facebook, google。在本文中，作者研究了大数据环境，并调查了数据是如何使用大数据消费的，以及支持工具是如何与Hadoop存储一起工作的。此外，为了深入了解和研究，本文通过Hadoop和MapReduce实现了一种聚类分析技术，更具体地说是k -均值聚类算法。聚类是大数据分析的一部分，其中未标记的数据被处理并用于数据组。此外，观察到传统的k-mean算法不太适合与Hadoop和MapReduce一起工作，因此对数据处理技术进行了少量修改。此外，在聚类分析过程中，传统的k-means还存在各种问题，即波动精度、异常值和空聚类。为此，提出并实现了一种改进传统k均值聚类方法的聚类算法。该方法首先通过去除数据集中的离群点来提高数据质量，然后使用双部分方法进行聚类。利用JAVA、Hadoop和MapReduce实现了本文提出的聚类技术，最后对本文提出的聚类方法的性能进行了评价，并与传统的k-means聚类算法进行了比较。实验结果表明，在去除脱效率后，簇的形成精度得到了提高。因此，本文提出的方法可以应用于大数据环境，提高了聚类的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Communication Networks (ICCN)

自引率

0.00%

发文量