Speedup of the k-Means Algorithm for Partitioning Large Datasets of Flat Points by a Preliminary Partition and Selecting Initial Centroids

IF 0.5 Q4 COMPUTER SCIENCE, THEORY & METHODS

Applied Computer Systems Pub Date : 2023-06-01 DOI:10.2478/acss-2023-0001

V. Romanuke

{"title":"Speedup of the k-Means Algorithm for Partitioning Large Datasets of Flat Points by a Preliminary Partition and Selecting Initial Centroids","authors":"V. Romanuke","doi":"10.2478/acss-2023-0001","DOIUrl":null,"url":null,"abstract":"Abstract A problem of partitioning large datasets of flat points is considered. Known as the centroid-based clustering problem, it is mainly addressed by the k-means algorithm and its modifications. As the k-means performance becomes poorer on large datasets, including the dataset shape stretching, the goal is to study a possibility of improving the centroid-based clustering for such cases. It is quite noticeable on non-sparse datasets that the resulting clusters produced by k-means resemble beehive honeycomb. It is natural for rectangular-shaped datasets because the hexagonal cells make efficient use of space owing to which the sum of the within-cluster squared Euclidean distances to the centroids is approximated to its minimum. Therefore, the lattices of rectangular and hexagonal clusters, consisting of stretched rectangles and regular hexagons, are suggested to be successively applied. Then the initial centroids are calculated by averaging within respective hexagons. These centroids are used as initial seeds to start the k-means algorithm. This ensures faster and more accurate convergence, where at least the expected speedup is 1.7 to 2.1 times by a 0.7 to 0.9 % accuracy gain. The lattice of rectangular clusters applied first makes rather rough but effective partition allowing to optionally run further clustering on parallel processor cores. The lattice of hexagonal clusters applied to every rectangle allows obtaining initial centroids very quickly. Such centroids are far closer to the solution than the initial centroids in the k-means++ algorithm. Another approach to the k-means update, where initial centroids are selected separately within every rectangle hexagons, can be used as well. It is faster than selecting initial centroids across all hexagons but is less accurate. The speedup is 9 to 11 times by a possible accuracy loss of 0.3 %. However, this approach may outperform the k-means algorithm. The speedup increases as both the lattices become denser and the dataset becomes larger reaching 30 to 50 times.","PeriodicalId":41960,"journal":{"name":"Applied Computer Systems","volume":"69 6 1","pages":"1 - 12"},"PeriodicalIF":0.5000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/acss-2023-0001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract A problem of partitioning large datasets of flat points is considered. Known as the centroid-based clustering problem, it is mainly addressed by the k-means algorithm and its modifications. As the k-means performance becomes poorer on large datasets, including the dataset shape stretching, the goal is to study a possibility of improving the centroid-based clustering for such cases. It is quite noticeable on non-sparse datasets that the resulting clusters produced by k-means resemble beehive honeycomb. It is natural for rectangular-shaped datasets because the hexagonal cells make efficient use of space owing to which the sum of the within-cluster squared Euclidean distances to the centroids is approximated to its minimum. Therefore, the lattices of rectangular and hexagonal clusters, consisting of stretched rectangles and regular hexagons, are suggested to be successively applied. Then the initial centroids are calculated by averaging within respective hexagons. These centroids are used as initial seeds to start the k-means algorithm. This ensures faster and more accurate convergence, where at least the expected speedup is 1.7 to 2.1 times by a 0.7 to 0.9 % accuracy gain. The lattice of rectangular clusters applied first makes rather rough but effective partition allowing to optionally run further clustering on parallel processor cores. The lattice of hexagonal clusters applied to every rectangle allows obtaining initial centroids very quickly. Such centroids are far closer to the solution than the initial centroids in the k-means++ algorithm. Another approach to the k-means update, where initial centroids are selected separately within every rectangle hexagons, can be used as well. It is faster than selecting initial centroids across all hexagons but is less accurate. The speedup is 9 to 11 times by a possible accuracy loss of 0.3 %. However, this approach may outperform the k-means algorithm. The speedup increases as both the lattices become denser and the dataset becomes larger reaching 30 to 50 times.

查看原文本刊更多论文

k-Means算法对大数据集平面点的初步划分和初始质心选择的加速研究

摘要:研究了大型平面点数据集的分区问题。它被称为基于质心的聚类问题，主要由k-means算法及其修正来解决。由于k-means在大型数据集上的性能变得越来越差，包括数据集形状拉伸，我们的目标是研究在这种情况下改进基于质心的聚类的可能性。在非稀疏数据集上，k-means产生的聚类类似于蜂巢。对于矩形数据集来说，这是很自然的，因为六边形单元有效地利用了空间，因此簇内到质心的欧氏距离平方的总和近似于最小值。因此，建议依次应用由拉伸矩形和正六边形组成的矩形和六边形簇的晶格。然后在各自的六边形内平均计算初始质心。这些质心被用作启动k-means算法的初始种子。这确保了更快和更精确的收敛，其中至少预期的加速是1.7到2.1倍，精度增益为0.7到0.9%。首先应用的矩形集群的晶格会产生相当粗糙但有效的分区，允许在并行处理器内核上选择性地运行进一步的集群。应用于每个矩形的六边形簇的晶格可以非常快速地获得初始质心。这样的质心比k-means++算法中的初始质心更接近解。k-means更新的另一种方法，即在每个矩形六边形中分别选择初始质心，也可以使用。它比在所有六边形中选择初始质心要快，但精度较低。由于可能的精度损失0.3%，加速提高了9到11倍。然而，这种方法可能优于k-means算法。随着格子变得更密集，数据集变得更大，加速会增加，达到30到50倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Computer Systems COMPUTER SCIENCE, THEORY & METHODS-

自引率

10.00%

发文量

审稿时长

30 weeks