多gpu集群上巨大图的可伸缩全对最短路径

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI:10.1145/3431379.3460651

Piyush Sao, Hao Lu, R. Kannan, Vijay Thakkar, R. Vuduc, T. Potok

{"title":"多gpu集群上巨大图的可伸缩全对最短路径","authors":"Piyush Sao, Hao Lu, R. Kannan, Vijay Thakkar, R. Vuduc, T. Potok","doi":"10.1145/3431379.3460651","DOIUrl":null,"url":null,"abstract":"We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters\",\"authors\":\"Piyush Sao, Hao Lu, R. Kannan, Vijay Thakkar, R. Vuduc, T. Potok\",\"doi\":\"10.1145/3431379.3460651\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.\",\"PeriodicalId\":343991,\"journal\":{\"name\":\"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing\",\"volume\":\"53 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3431379.3460651\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431379.3460651","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

我们提出了一种优化的Floyd-Warshall (Floyd-Warshall)算法，用于计算GPU加速集群的全对最短路径(APSP)。Floyd-Warshall算法由于其与矩阵乘法的结构相似性，非常适合于高度并行的GPU架构。为了实现高并行效率，我们解决了两个关键的算法挑战:减少高通信开销和解决有限的GPU内存。为了降低高昂的通信成本，我们重新设计了并行(a)以暴露更多的并行性，(b)积极地将通信和计算与流水线和异步调度操作重叠，以及(c)定制mpi集合。为了应对有限的GPU内存，我们采用了一种卸载模型，其中数据驻留在主机上并按需传输到GPU。建议的优化得到了用于调优的详细性能模型的支持。我们优化的并行Floyd-Warshall实现比强基线快5倍，在橡树岭国家实验室Summit超级计算机的256个节点上达到8.1 PetaFLOPS/sec。该性能代表理论峰值的70%和并行效率的80%。卸载算法可以处理2.5倍大的图，总体运行时间增加20%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters

We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

自引率

0.00%

发文量