A Data Science and Engineering Solution for Fast K-Means Clustering of Big Data

K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind
{"title":"A Data Science and Engineering Solution for Fast K-Means Clustering of Big Data","authors":"K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind","doi":"10.1109/Trustcom/BigDataSE/ICESS.2017.332","DOIUrl":null,"url":null,"abstract":"With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.","PeriodicalId":170253,"journal":{"name":"2017 IEEE Trustcom/BigDataSE/ICESS","volume":"17 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Trustcom/BigDataSE/ICESS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30

Abstract

With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.
大数据快速k均值聚类的数据科学与工程解决方案
随着技术的进步,在当前的大数据时代,可以很容易地收集或高速生成大量、种类繁多、不同准确性的有价值数据。在这些大数据中嵌入了隐含的、以前未知的、潜在有用的信息。因此,需要从这些大数据中挖掘和发现知识的快速、可扩展的大数据科学和工程解决方案。一个流行且实用的数据挖掘任务是将相似的数据分组到集群中(即聚类)。为了对非常大的数据或大数据进行聚类,基于k-means的算法已经被广泛使用。虽然许多现有的k-means算法给出了高质量的结果,但它们也存在一些问题。例如,随机选择k个质心存在风险,可能会产生大致相等的圆形簇,并且运行时复杂性非常高。为了解决这些问题,本文提出了一种基于启发式原型算法的大数据科学与工程解决方案。评估结果表明了该方案的有效性和可扩展性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信