Q. Zhou, C. Chu, N. S. Kumar, Pouya Kousha, S. M. Ghazimirsaeed, H. Subramoni, D. Panda
{"title":"设计高性能的MPI库与现代GPU集群的动态压缩*","authors":"Q. Zhou, C. Chu, N. S. Kumar, Pouya Kousha, S. M. Ghazimirsaeed, H. Subramoni, D. Panda","doi":"10.1109/IPDPS49936.2021.00053","DOIUrl":null,"url":null,"abstract":"While the memory bandwidth of accelerators such as GPU has significantly improved over the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in terms of raw throughput creating. Although there are significant research efforts on improving the large message data transfers for GPU-resident data, the inter-node communication remains the major performance bottleneck due to the data explosion created by the emerging High-Performance Computing (HPC) applications. On the other hand, the recent developments in GPU-based compression algorithms exemplify the potential of using high-performance message compression techniques to reduce the volume of data transferred thereby reducing the load on an already overloaded inter-node communication fabric. The existing GPU-based compression schemes are not designed for “on-the-fly” execution and lead to severe performance degradation when integrated into the communication libraries. In this paper, we take up this challenge and redesign the MVAPICH2 MPI library to enable high-performance, on-the-fly message compression for modern, dense GPU clusters. We also enhance existing implementations of lossless and lossy compression algorithms, MPC and ZFP, to provide high-performance, on-the-fly message compression and decompression. We demonstrate that our proposed designs can offer significant benefits at the microbenchmark and application-levels. The proposed design is able to provide up to 19% and 37% improvement in the GPU computing flops of AWP-ODC with the enhanced MPCOPT and ZFP-OPT schemes, respectively. Moreover, we gain up to 1.56x improvement in Dask throughput. To the best of our knowledge, this is the first work that leverages the GPU-based compression techniques to significantly improve the GPU communication performance for various MPI primitives, MPI-based data science, and HPC applications.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters*\",\"authors\":\"Q. Zhou, C. Chu, N. S. Kumar, Pouya Kousha, S. M. Ghazimirsaeed, H. Subramoni, D. Panda\",\"doi\":\"10.1109/IPDPS49936.2021.00053\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While the memory bandwidth of accelerators such as GPU has significantly improved over the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in terms of raw throughput creating. Although there are significant research efforts on improving the large message data transfers for GPU-resident data, the inter-node communication remains the major performance bottleneck due to the data explosion created by the emerging High-Performance Computing (HPC) applications. On the other hand, the recent developments in GPU-based compression algorithms exemplify the potential of using high-performance message compression techniques to reduce the volume of data transferred thereby reducing the load on an already overloaded inter-node communication fabric. The existing GPU-based compression schemes are not designed for “on-the-fly” execution and lead to severe performance degradation when integrated into the communication libraries. In this paper, we take up this challenge and redesign the MVAPICH2 MPI library to enable high-performance, on-the-fly message compression for modern, dense GPU clusters. We also enhance existing implementations of lossless and lossy compression algorithms, MPC and ZFP, to provide high-performance, on-the-fly message compression and decompression. We demonstrate that our proposed designs can offer significant benefits at the microbenchmark and application-levels. The proposed design is able to provide up to 19% and 37% improvement in the GPU computing flops of AWP-ODC with the enhanced MPCOPT and ZFP-OPT schemes, respectively. Moreover, we gain up to 1.56x improvement in Dask throughput. To the best of our knowledge, this is the first work that leverages the GPU-based compression techniques to significantly improve the GPU communication performance for various MPI primitives, MPI-based data science, and HPC applications.\",\"PeriodicalId\":372234,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS49936.2021.00053\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters*
While the memory bandwidth of accelerators such as GPU has significantly improved over the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in terms of raw throughput creating. Although there are significant research efforts on improving the large message data transfers for GPU-resident data, the inter-node communication remains the major performance bottleneck due to the data explosion created by the emerging High-Performance Computing (HPC) applications. On the other hand, the recent developments in GPU-based compression algorithms exemplify the potential of using high-performance message compression techniques to reduce the volume of data transferred thereby reducing the load on an already overloaded inter-node communication fabric. The existing GPU-based compression schemes are not designed for “on-the-fly” execution and lead to severe performance degradation when integrated into the communication libraries. In this paper, we take up this challenge and redesign the MVAPICH2 MPI library to enable high-performance, on-the-fly message compression for modern, dense GPU clusters. We also enhance existing implementations of lossless and lossy compression algorithms, MPC and ZFP, to provide high-performance, on-the-fly message compression and decompression. We demonstrate that our proposed designs can offer significant benefits at the microbenchmark and application-levels. The proposed design is able to provide up to 19% and 37% improvement in the GPU computing flops of AWP-ODC with the enhanced MPCOPT and ZFP-OPT schemes, respectively. Moreover, we gain up to 1.56x improvement in Dask throughput. To the best of our knowledge, this is the first work that leverages the GPU-based compression techniques to significantly improve the GPU communication performance for various MPI primitives, MPI-based data science, and HPC applications.