更快的并行精确密度峰聚类(摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing Pub Date : 2023-05-18 DOI:10.1145/3597635.3598021

Yihao Huang, Shangdi Yu, Julian Shun

{"title":"更快的并行精确密度峰聚类(摘要)","authors":"Yihao Huang, Shangdi Yu, Julian Shun","doi":"10.1145/3597635.3598021","DOIUrl":null,"url":null,"abstract":"Clustering multidimensional points is a fundamental data mining task, with applications in many fields, such as astronomy, neuroscience, bioinformatics, and computer vision. The goal of clustering algorithms is to group similar objects together. Density-based clustering is a clustering approach that defines clusters as dense regions of points. It has the advantage of being able to detect clusters of arbitrary shapes, rendering it useful in many applications. In this paper, we propose fast parallel algorithms for Density Peaks Clustering (DPC), a popular version of density-based clustering. Existing exact DPC algorithms suffer from low parallelism both in theory and in practice, which limits their application to large-scale data sets. Our most performant algorithm, which is based on priority search d-trees, achieves O (log n log log n) span (parallel time complexity) for a data set of n points. Our algorithm is also work-efficient, achieving a work complexity matching the best existing sequential exact DPC algorithm. In addition, we present another DPC algorithm based on a Fenwick tree that makes fewer assumptions for its average-case complexity to hold. We provide optimized implementations of our algorithms and evaluate their performance via extensive experiments. On a 30-core machine with two-way hyperthreading, we find that our best algorithm achieves a 10.8-13169x speedup over the previous best parallel exact DPC algorithm. Compared to the state-of-the-art parallel approximate DPC algorithm, our best algorithm achieves a 1.5-4206X speedup, while being exact.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Faster Parallel Exact Density Peaks Clustering (Abstract)\",\"authors\":\"Yihao Huang, Shangdi Yu, Julian Shun\",\"doi\":\"10.1145/3597635.3598021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Clustering multidimensional points is a fundamental data mining task, with applications in many fields, such as astronomy, neuroscience, bioinformatics, and computer vision. The goal of clustering algorithms is to group similar objects together. Density-based clustering is a clustering approach that defines clusters as dense regions of points. It has the advantage of being able to detect clusters of arbitrary shapes, rendering it useful in many applications. In this paper, we propose fast parallel algorithms for Density Peaks Clustering (DPC), a popular version of density-based clustering. Existing exact DPC algorithms suffer from low parallelism both in theory and in practice, which limits their application to large-scale data sets. Our most performant algorithm, which is based on priority search d-trees, achieves O (log n log log n) span (parallel time complexity) for a data set of n points. Our algorithm is also work-efficient, achieving a work complexity matching the best existing sequential exact DPC algorithm. In addition, we present another DPC algorithm based on a Fenwick tree that makes fewer assumptions for its average-case complexity to hold. We provide optimized implementations of our algorithms and evaluate their performance via extensive experiments. On a 30-core machine with two-way hyperthreading, we find that our best algorithm achieves a 10.8-13169x speedup over the previous best parallel exact DPC algorithm. Compared to the state-of-the-art parallel approximate DPC algorithm, our best algorithm achieves a 1.5-4206X speedup, while being exact.\",\"PeriodicalId\":185981,\"journal\":{\"name\":\"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3597635.3598021\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3597635.3598021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

聚类多维点是一项基本的数据挖掘任务，在天文学、神经科学、生物信息学和计算机视觉等许多领域都有应用。聚类算法的目标是将相似的对象分组在一起。基于密度的聚类是一种将聚类定义为点的密集区域的聚类方法。它的优点是能够检测任意形状的簇，这使得它在许多应用程序中都很有用。在本文中，我们提出了密度峰值聚类(DPC)的快速并行算法，DPC是基于密度的聚类的一个流行版本。现有的精确DPC算法在理论和实践上都存在并行度低的问题，限制了其在大规模数据集上的应用。我们最高性能的算法是基于优先级搜索d树的算法，对于n个点的数据集实现O (log n log log n)跨度(并行时间复杂度)。该算法具有较高的工作效率，实现了与现有最佳顺序精确DPC算法相当的工作复杂度。此外，我们提出了另一种基于Fenwick树的DPC算法，该算法对其平均情况复杂性的假设较少。我们提供算法的优化实现，并通过广泛的实验评估其性能。在具有双向超线程的30核机器上，我们发现我们的最佳算法比之前的最佳并行精确DPC算法实现了10.8-13169x的加速。与最先进的并行近似DPC算法相比，我们的最佳算法在精确的情况下实现了1.5-4206X的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Faster Parallel Exact Density Peaks Clustering (Abstract)

Clustering multidimensional points is a fundamental data mining task, with applications in many fields, such as astronomy, neuroscience, bioinformatics, and computer vision. The goal of clustering algorithms is to group similar objects together. Density-based clustering is a clustering approach that defines clusters as dense regions of points. It has the advantage of being able to detect clusters of arbitrary shapes, rendering it useful in many applications. In this paper, we propose fast parallel algorithms for Density Peaks Clustering (DPC), a popular version of density-based clustering. Existing exact DPC algorithms suffer from low parallelism both in theory and in practice, which limits their application to large-scale data sets. Our most performant algorithm, which is based on priority search d-trees, achieves O (log n log log n) span (parallel time complexity) for a data set of n points. Our algorithm is also work-efficient, achieving a work complexity matching the best existing sequential exact DPC algorithm. In addition, we present another DPC algorithm based on a Fenwick tree that makes fewer assumptions for its average-case complexity to hold. We provide optimized implementations of our algorithms and evaluate their performance via extensive experiments. On a 30-core machine with two-way hyperthreading, we find that our best algorithm achieves a 10.8-13169x speedup over the previous best parallel exact DPC algorithm. Compared to the state-of-the-art parallel approximate DPC algorithm, our best algorithm achieves a 1.5-4206X speedup, while being exact.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

自引率

0.00%

发文量