使用投影数据更快更好地聚类

Proceedings of the 6th International Conference on Information System and Data Mining Pub Date : 2022-05-27 DOI:10.1145/3546157.3546158

Alibek Zhakubayev, Greg Hamerly

{"title":"使用投影数据更快更好地聚类","authors":"Alibek Zhakubayev, Greg Hamerly","doi":"10.1145/3546157.3546158","DOIUrl":null,"url":null,"abstract":"The K-means clustering algorithm can take a lot of time to converge, especially for large datasets in high dimension and a large number of clusters. By applying several enhancements it is possible to improve the performance without significantly changing the quality of the clustering. In this paper we first find a good clustering in a reduced-dimension version of the dataset, before fine-tuning the clustering in the original dimension. This saves time because accelerated K-means algorithms are fastest in low dimension, and the initial low-dimensional clustering bring us close to a good solution for the original data. We use random projection to reduce the dimension, as it is fast and maintains the cluster properties we want to preserve. In our experiments, we see that this approach significantly reduces the time needed for clustering a dataset and in most cases produces better results.","PeriodicalId":422215,"journal":{"name":"Proceedings of the 6th International Conference on Information System and Data Mining","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Clustering Faster and Better with Projected Data\",\"authors\":\"Alibek Zhakubayev, Greg Hamerly\",\"doi\":\"10.1145/3546157.3546158\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The K-means clustering algorithm can take a lot of time to converge, especially for large datasets in high dimension and a large number of clusters. By applying several enhancements it is possible to improve the performance without significantly changing the quality of the clustering. In this paper we first find a good clustering in a reduced-dimension version of the dataset, before fine-tuning the clustering in the original dimension. This saves time because accelerated K-means algorithms are fastest in low dimension, and the initial low-dimensional clustering bring us close to a good solution for the original data. We use random projection to reduce the dimension, as it is fast and maintains the cluster properties we want to preserve. In our experiments, we see that this approach significantly reduces the time needed for clustering a dataset and in most cases produces better results.\",\"PeriodicalId\":422215,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on Information System and Data Mining\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on Information System and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3546157.3546158\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Information System and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546157.3546158","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

K-means聚类算法收敛时间较长，特别是对于高维的大型数据集和大量的聚类。通过应用一些增强功能，可以在不显著改变集群质量的情况下提高性能。在本文中，我们首先在数据集的降维版本中找到一个好的聚类，然后在原始维度上微调聚类。这节省了时间，因为加速K-means算法在低维上是最快的，并且初始的低维聚类使我们接近于原始数据的一个很好的解。我们使用随机投影来降低维数，因为它既快速又保持了我们想要保留的簇属性。在我们的实验中，我们看到这种方法大大减少了聚类数据集所需的时间，并且在大多数情况下产生了更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Clustering Faster and Better with Projected Data

The K-means clustering algorithm can take a lot of time to converge, especially for large datasets in high dimension and a large number of clusters. By applying several enhancements it is possible to improve the performance without significantly changing the quality of the clustering. In this paper we first find a good clustering in a reduced-dimension version of the dataset, before fine-tuning the clustering in the original dimension. This saves time because accelerated K-means algorithms are fastest in low dimension, and the initial low-dimensional clustering bring us close to a good solution for the original data. We use random projection to reduce the dimension, as it is fast and maintains the cluster properties we want to preserve. In our experiments, we see that this approach significantly reduces the time needed for clustering a dataset and in most cases produces better results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 6th International Conference on Information System and Data Mining

自引率

0.00%

发文量