并行空间聚类算法数据分布策略调查与实验综述

IF 1.2 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Computer Science and Technology Pub Date : 2024-06-26 DOI:10.1007/s11390-024-2700-0

Jagat Sesh Challa, Navneet Goyal, Amogh Sharma, Nikhil Sreekumar, Sundar Balasubramaniam, Poonam Goyal

{"title":"并行空间聚类算法数据分布策略调查与实验综述","authors":"Jagat Sesh Challa, Navneet Goyal, Amogh Sharma, Nikhil Sreekumar, Sundar Balasubramaniam, Poonam Goyal","doi":"10.1007/s11390-024-2700-0","DOIUrl":null,"url":null,"abstract":"The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI, MapReduce, and Spark. An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes. This step governs the methodology and performance of the entire algorithm. Researchers typically use random, or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning, as per the requirements of the algorithm. However, these strategies are generic and are not tailor-made for any specific parallel clustering algorithm. In this paper, we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they employ. We also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS, Cell-Based Dimensional Split for dGridSLINK, which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution, and Projection-Based Split, which is a generic distribution strategy. All of these preserve spatial locality, achieve disjoint partitioning, and ensure good data load balancing. The experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for, based on which we give appropriate recommendations for their usage.","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"16 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering Algorithms\",\"authors\":\"Jagat Sesh Challa, Navneet Goyal, Amogh Sharma, Nikhil Sreekumar, Sundar Balasubramaniam, Poonam Goyal\",\"doi\":\"10.1007/s11390-024-2700-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI, MapReduce, and Spark. An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes. This step governs the methodology and performance of the entire algorithm. Researchers typically use random, or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning, as per the requirements of the algorithm. However, these strategies are generic and are not tailor-made for any specific parallel clustering algorithm. In this paper, we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they employ. We also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS, Cell-Based Dimensional Split for dGridSLINK, which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution, and Projection-Based Split, which is a generic distribution strategy. All of these preserve spatial locality, achieve disjoint partitioning, and ensure good data load balancing. The experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for, based on which we give appropriate recommendations for their usage.\",\"PeriodicalId\":50222,\"journal\":{\"name\":\"Journal of Computer Science and Technology\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2024-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Science and Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11390-024-2700-0\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11390-024-2700-0","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

大数据的出现使通过分布式计算框架（如 MPI、MapReduce 和 Spark）工作的并行聚类算法的使用量迅速增长。任何并行聚类算法的一个重要步骤都是在集群节点之间分配数据。这一步决定了整个算法的方法和性能。研究人员通常根据算法的要求，使用随机或空间/几何分布策略，如基于 kd 树的分区和基于网格的分区。然而，这些策略都是通用的，并不是为任何特定的并行聚类算法量身定制的。在本文中，我们对基于 MPI 的并行聚类算法进行了非常全面的文献调查，并特别提到了这些算法所采用的特定数据分布策略。我们还提出了三种新的数据分布策略，即针对 DBSCAN 和 OPTICS 等基于密度的并行聚类算法的参数化维度拆分、针对 dGridSLINK 的基于单元格的维度拆分（dGridSLINK 是一种基于网格的分层聚类算法，在空间分布不连续时表现出高效性）以及基于投影的拆分（一种通用的分布策略）。所有这些都能保持空间位置性，实现不相交的分区，并确保良好的数据负载平衡。实验分析表明了针对所设计的算法使用所提出的数据分布策略的好处，在此基础上，我们给出了使用这些策略的适当建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering Algorithms

The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI, MapReduce, and Spark. An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes. This step governs the methodology and performance of the entire algorithm. Researchers typically use random, or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning, as per the requirements of the algorithm. However, these strategies are generic and are not tailor-made for any specific parallel clustering algorithm. In this paper, we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they employ. We also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS, Cell-Based Dimensional Split for dGridSLINK, which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution, and Projection-Based Split, which is a generic distribution strategy. All of these preserve spatial locality, achieve disjoint partitioning, and ensure good data load balancing. The experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for, based on which we give appropriate recommendations for their usage.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Computer Science and Technology 工程技术-计算机：软件工程

CiteScore

4.00

自引率

0.00%

发文量

2255

审稿时长

9.8 months

期刊介绍： Journal of Computer Science and Technology (JCST), the first English language journal in the computer field published in China, is an international forum for scientists and engineers involved in all aspects of computer science and technology to publish high quality and refereed papers. Papers reporting original research and innovative applications from all parts of the world are welcome. Papers for publication in the journal are selected through rigorous peer review, to ensure originality, timeliness, relevance, and readability. While the journal emphasizes the publication of previously unpublished materials, selected conference papers with exceptional merit that require wider exposure are, at the discretion of the editors, also published, provided they meet the journal''s peer review standards. The journal also seeks clearly written survey and review articles from experts in the field, to promote insightful understanding of the state-of-the-art and technology trends. Topics covered by Journal of Computer Science and Technology include but are not limited to: -Computer Architecture and Systems -Artificial Intelligence and Pattern Recognition -Computer Networks and Distributed Computing -Computer Graphics and Multimedia -Software Systems -Data Management and Data Mining -Theory and Algorithms -Emerging Areas