设计高性能的MPI库与现代GPU集群的动态压缩*

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI:10.1109/IPDPS49936.2021.00053

Q. Zhou, C. Chu, N. S. Kumar, Pouya Kousha, S. M. Ghazimirsaeed, H. Subramoni, D. Panda

{"title":"设计高性能的MPI库与现代GPU集群的动态压缩*","authors":"Q. Zhou, C. Chu, N. S. Kumar, Pouya Kousha, S. M. Ghazimirsaeed, H. Subramoni, D. Panda","doi":"10.1109/IPDPS49936.2021.00053","DOIUrl":null,"url":null,"abstract":"While the memory bandwidth of accelerators such as GPU has significantly improved over the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in terms of raw throughput creating. Although there are significant research efforts on improving the large message data transfers for GPU-resident data, the inter-node communication remains the major performance bottleneck due to the data explosion created by the emerging High-Performance Computing (HPC) applications. On the other hand, the recent developments in GPU-based compression algorithms exemplify the potential of using high-performance message compression techniques to reduce the volume of data transferred thereby reducing the load on an already overloaded inter-node communication fabric. The existing GPU-based compression schemes are not designed for “on-the-fly” execution and lead to severe performance degradation when integrated into the communication libraries. In this paper, we take up this challenge and redesign the MVAPICH2 MPI library to enable high-performance, on-the-fly message compression for modern, dense GPU clusters. We also enhance existing implementations of lossless and lossy compression algorithms, MPC and ZFP, to provide high-performance, on-the-fly message compression and decompression. We demonstrate that our proposed designs can offer significant benefits at the microbenchmark and application-levels. The proposed design is able to provide up to 19% and 37% improvement in the GPU computing flops of AWP-ODC with the enhanced MPCOPT and ZFP-OPT schemes, respectively. Moreover, we gain up to 1.56x improvement in Dask throughput. To the best of our knowledge, this is the first work that leverages the GPU-based compression techniques to significantly improve the GPU communication performance for various MPI primitives, MPI-based data science, and HPC applications.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters*\",\"authors\":\"Q. Zhou, C. Chu, N. S. Kumar, Pouya Kousha, S. M. Ghazimirsaeed, H. Subramoni, D. Panda\",\"doi\":\"10.1109/IPDPS49936.2021.00053\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While the memory bandwidth of accelerators such as GPU has significantly improved over the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in terms of raw throughput creating. Although there are significant research efforts on improving the large message data transfers for GPU-resident data, the inter-node communication remains the major performance bottleneck due to the data explosion created by the emerging High-Performance Computing (HPC) applications. On the other hand, the recent developments in GPU-based compression algorithms exemplify the potential of using high-performance message compression techniques to reduce the volume of data transferred thereby reducing the load on an already overloaded inter-node communication fabric. The existing GPU-based compression schemes are not designed for “on-the-fly” execution and lead to severe performance degradation when integrated into the communication libraries. In this paper, we take up this challenge and redesign the MVAPICH2 MPI library to enable high-performance, on-the-fly message compression for modern, dense GPU clusters. We also enhance existing implementations of lossless and lossy compression algorithms, MPC and ZFP, to provide high-performance, on-the-fly message compression and decompression. We demonstrate that our proposed designs can offer significant benefits at the microbenchmark and application-levels. The proposed design is able to provide up to 19% and 37% improvement in the GPU computing flops of AWP-ODC with the enhanced MPCOPT and ZFP-OPT schemes, respectively. Moreover, we gain up to 1.56x improvement in Dask throughput. To the best of our knowledge, this is the first work that leverages the GPU-based compression techniques to significantly improve the GPU communication performance for various MPI primitives, MPI-based data science, and HPC applications.\",\"PeriodicalId\":372234,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS49936.2021.00053\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

虽然GPU等加速器的内存带宽在过去十年中有了显著提高，但以太网和InfiniBand等商品网络在原始吞吐量方面却落后了。尽管在改进gpu驻留数据的大消息数据传输方面有大量的研究工作，但由于新兴的高性能计算(HPC)应用程序产生的数据爆炸，节点间通信仍然是主要的性能瓶颈。另一方面，基于gpu的压缩算法的最新发展表明，使用高性能消息压缩技术可以减少传输的数据量，从而减少已经过载的节点间通信结构的负载。现有的基于gpu的压缩方案不是为“即时”执行而设计的，当集成到通信库中时，会导致严重的性能下降。在本文中，我们接受了这一挑战，并重新设计了MVAPICH2 MPI库，以便为现代密集的GPU集群实现高性能的动态消息压缩。我们还增强了现有的无损和有损压缩算法MPC和ZFP的实现，以提供高性能的实时消息压缩和解压缩。我们证明了我们提出的设计可以在微基准测试和应用程序级别提供显着的好处。提出的设计能够在AWP-ODC的GPU计算失败中分别提供19%和37%的改进，增强MPCOPT和ZFP-OPT方案。此外，我们在任务吞吐量方面提高了1.56倍。据我们所知，这是第一个利用基于GPU的压缩技术来显著提高各种MPI原语、基于MPI的数据科学和HPC应用程序的GPU通信性能的工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters*

While the memory bandwidth of accelerators such as GPU has significantly improved over the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in terms of raw throughput creating. Although there are significant research efforts on improving the large message data transfers for GPU-resident data, the inter-node communication remains the major performance bottleneck due to the data explosion created by the emerging High-Performance Computing (HPC) applications. On the other hand, the recent developments in GPU-based compression algorithms exemplify the potential of using high-performance message compression techniques to reduce the volume of data transferred thereby reducing the load on an already overloaded inter-node communication fabric. The existing GPU-based compression schemes are not designed for “on-the-fly” execution and lead to severe performance degradation when integrated into the communication libraries. In this paper, we take up this challenge and redesign the MVAPICH2 MPI library to enable high-performance, on-the-fly message compression for modern, dense GPU clusters. We also enhance existing implementations of lossless and lossy compression algorithms, MPC and ZFP, to provide high-performance, on-the-fly message compression and decompression. We demonstrate that our proposed designs can offer significant benefits at the microbenchmark and application-levels. The proposed design is able to provide up to 19% and 37% improvement in the GPU computing flops of AWP-ODC with the enhanced MPCOPT and ZFP-OPT schemes, respectively. Moreover, we gain up to 1.56x improvement in Dask throughput. To the best of our knowledge, this is the first work that leverages the GPU-based compression techniques to significantly improve the GPU communication performance for various MPI primitives, MPI-based data science, and HPC applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量