Chen-Chun Chen, Kawthar Shafie Khorassani, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda
{"title":"GPU系统的高效Alltoall和Alltoallv通信算法","authors":"Chen-Chun Chen, Kawthar Shafie Khorassani, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW55747.2022.00014","DOIUrl":null,"url":null,"abstract":"In recent years, High Performance Computing (HPC) and Deep Learning (DL) applications have been modified to run on top supercomputers and utilize the high compute power of GPUs. While GPUs provide high computational power, communication of data between GPUs and across a network continues to be a bottleneck. In particular, with the increasing amount of FFT compute and sparse matrix transpose operations in these applications, Alltoall MPI collective operations are heavily used. Alltoall communication is considered the heaviest communication pattern compared to other MPI collective calls. Few techniques and algorithms effectively help in optimizing Alltoall communication, much less improving the performance on a dense GPU cluster while exploiting the features of modern inter-connects and topologies. Despite the introduction of Inter-Process Communication (IPC) in CUDA 4.1 by NVIDIA, state-of-the-art MPI libraries have not utilized these IPC-based mechanisms to design novel Alltoall algorithms that exploit the capabilities of modern GPUs. In this paper, we propose hybrid IPC-advanced designs for Alltoall and Alltoallv communication on novel GPU systems. By utilizing zero-copy load-store IPC mechanisms for multi-GPU communication within a node, we are able to overlap the intra-node and inter-node communication, yielding improved performance on GPU systems. We evaluate the benefits of our designs at the benchmark and application layers on the ThetaGPU system at ALCF and the Lassen system at LLNL. Our designs provide up to 13.5x and 71% improvements on 128 GPUs and 64 GPUs at the benchmark-level over state-of-the-art MPI libraries on ThetaGPU and Lassen respectively. At the application level, our designs have up to 59x performance improvement for an HPC application, heFFTe, and 5.7x performance improvement for a Deep Learning application, DeepSpeed, on 64 GPUs on ThetaGPU and 256 GPUs on Lassen.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems\",\"authors\":\"Chen-Chun Chen, Kawthar Shafie Khorassani, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda\",\"doi\":\"10.1109/IPDPSW55747.2022.00014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, High Performance Computing (HPC) and Deep Learning (DL) applications have been modified to run on top supercomputers and utilize the high compute power of GPUs. While GPUs provide high computational power, communication of data between GPUs and across a network continues to be a bottleneck. In particular, with the increasing amount of FFT compute and sparse matrix transpose operations in these applications, Alltoall MPI collective operations are heavily used. Alltoall communication is considered the heaviest communication pattern compared to other MPI collective calls. Few techniques and algorithms effectively help in optimizing Alltoall communication, much less improving the performance on a dense GPU cluster while exploiting the features of modern inter-connects and topologies. Despite the introduction of Inter-Process Communication (IPC) in CUDA 4.1 by NVIDIA, state-of-the-art MPI libraries have not utilized these IPC-based mechanisms to design novel Alltoall algorithms that exploit the capabilities of modern GPUs. In this paper, we propose hybrid IPC-advanced designs for Alltoall and Alltoallv communication on novel GPU systems. By utilizing zero-copy load-store IPC mechanisms for multi-GPU communication within a node, we are able to overlap the intra-node and inter-node communication, yielding improved performance on GPU systems. We evaluate the benefits of our designs at the benchmark and application layers on the ThetaGPU system at ALCF and the Lassen system at LLNL. Our designs provide up to 13.5x and 71% improvements on 128 GPUs and 64 GPUs at the benchmark-level over state-of-the-art MPI libraries on ThetaGPU and Lassen respectively. At the application level, our designs have up to 59x performance improvement for an HPC application, heFFTe, and 5.7x performance improvement for a Deep Learning application, DeepSpeed, on 64 GPUs on ThetaGPU and 256 GPUs on Lassen.\",\"PeriodicalId\":286968,\"journal\":{\"name\":\"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW55747.2022.00014\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW55747.2022.00014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems
In recent years, High Performance Computing (HPC) and Deep Learning (DL) applications have been modified to run on top supercomputers and utilize the high compute power of GPUs. While GPUs provide high computational power, communication of data between GPUs and across a network continues to be a bottleneck. In particular, with the increasing amount of FFT compute and sparse matrix transpose operations in these applications, Alltoall MPI collective operations are heavily used. Alltoall communication is considered the heaviest communication pattern compared to other MPI collective calls. Few techniques and algorithms effectively help in optimizing Alltoall communication, much less improving the performance on a dense GPU cluster while exploiting the features of modern inter-connects and topologies. Despite the introduction of Inter-Process Communication (IPC) in CUDA 4.1 by NVIDIA, state-of-the-art MPI libraries have not utilized these IPC-based mechanisms to design novel Alltoall algorithms that exploit the capabilities of modern GPUs. In this paper, we propose hybrid IPC-advanced designs for Alltoall and Alltoallv communication on novel GPU systems. By utilizing zero-copy load-store IPC mechanisms for multi-GPU communication within a node, we are able to overlap the intra-node and inter-node communication, yielding improved performance on GPU systems. We evaluate the benefits of our designs at the benchmark and application layers on the ThetaGPU system at ALCF and the Lassen system at LLNL. Our designs provide up to 13.5x and 71% improvements on 128 GPUs and 64 GPUs at the benchmark-level over state-of-the-art MPI libraries on ThetaGPU and Lassen respectively. At the application level, our designs have up to 59x performance improvement for an HPC application, heFFTe, and 5.7x performance improvement for a Deep Learning application, DeepSpeed, on 64 GPUs on ThetaGPU and 256 GPUs on Lassen.