Estimating the Impact of Communication Schemes for Distributed Graph Processing

2022 21st International Symposium on Parallel and Distributed Computing (ISPDC) Pub Date : 2022-07-01 DOI:10.1109/ISPDC55340.2022.00016

Tian Ye, S. Kuppannagari, C. Rose, Sasindu Wijeratne, R. Kannan, V. Prasanna

{"title":"Estimating the Impact of Communication Schemes for Distributed Graph Processing","authors":"Tian Ye, S. Kuppannagari, C. Rose, Sasindu Wijeratne, R. Kannan, V. Prasanna","doi":"10.1109/ISPDC55340.2022.00016","DOIUrl":null,"url":null,"abstract":"Extreme scale graph analytics is imperative for several real-world Big Data applications with the underlying graph structure containing millions or billions of vertices and edges. Since such huge graphs cannot fit into the memory of a single computer, distributed processing of the graph is required. Several frameworks have been developed for performing graph processing on distributed systems. The frameworks focus primarily on choosing the right computation model and the partitioning scheme under the assumption that such design choices will automatically reduce the communication overheads. For any computational model and partitioning scheme, communication schemes — the data to be communicated and the virtual interconnection network among the nodes — have significant impact on the performance. To analyze this impact, in this work, we identify widely used communication schemes and estimate their performance. Analyzing the trade-offs between the number of compute nodes and communication costs of various schemes on a distributed platform by brute force experimentation can be prohibitively expensive. Thus, our performance estimation models provide an economic way to perform the analyses given the partitions and the communication scheme as input. We validate our model on a local HPC cluster as well as the cloud hosted NSF Chameleon cluster. Using our estimates as well as the actual measurements, we compare the communication schemes and provide conditions under which one scheme should be preferred over the others.","PeriodicalId":389334,"journal":{"name":"2022 21st International Symposium on Parallel and Distributed Computing (ISPDC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st International Symposium on Parallel and Distributed Computing (ISPDC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPDC55340.2022.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Extreme scale graph analytics is imperative for several real-world Big Data applications with the underlying graph structure containing millions or billions of vertices and edges. Since such huge graphs cannot fit into the memory of a single computer, distributed processing of the graph is required. Several frameworks have been developed for performing graph processing on distributed systems. The frameworks focus primarily on choosing the right computation model and the partitioning scheme under the assumption that such design choices will automatically reduce the communication overheads. For any computational model and partitioning scheme, communication schemes — the data to be communicated and the virtual interconnection network among the nodes — have significant impact on the performance. To analyze this impact, in this work, we identify widely used communication schemes and estimate their performance. Analyzing the trade-offs between the number of compute nodes and communication costs of various schemes on a distributed platform by brute force experimentation can be prohibitively expensive. Thus, our performance estimation models provide an economic way to perform the analyses given the partitions and the communication scheme as input. We validate our model on a local HPC cluster as well as the cloud hosted NSF Chameleon cluster. Using our estimates as well as the actual measurements, we compare the communication schemes and provide conditions under which one scheme should be preferred over the others.

查看原文本刊更多论文

估计通信方案对分布式图处理的影响

对于包含数百万或数十亿个顶点和边的底层图结构的几个现实世界的大数据应用程序来说，极端尺度图分析是必不可少的。由于如此庞大的图形无法装入单个计算机的内存，因此需要对图形进行分布式处理。为了在分布式系统上执行图形处理，已经开发了几个框架。这些框架主要关注选择正确的计算模型和划分方案，并假设这样的设计选择将自动减少通信开销。对于任何计算模型和分区方案，通信方案——要通信的数据和节点之间的虚拟互联网络——对性能有重要影响。为了分析这种影响，在这项工作中，我们确定了广泛使用的通信方案并估计了它们的性能。通过蛮力实验分析分布式平台上各种方案的计算节点数量和通信成本之间的权衡可能会非常昂贵。因此，我们的性能估计模型提供了一种经济的方法来执行给定分区和通信方案作为输入的分析。我们在本地HPC集群和云托管的NSF变色龙集群上验证了我们的模型。使用我们的估计和实际测量，我们比较了通信方案，并提供了一种方案应该优于其他方案的条件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 21st International Symposium on Parallel and Distributed Computing (ISPDC)

自引率

0.00%

发文量