{"title":"GPU上的Hyper-Q感知内部网MPI集合","authors":"Iman Faraji, A. Afsahi","doi":"10.1145/2832241.2832247","DOIUrl":null,"url":null,"abstract":"In GPU clusters, high GPU utilization and efficient communication play an important role in the performance of the MPI applications. To improve the GPU utilization, NVIDIA has introduced the Multi Process Service (MPS), eliminating the context-switching overhead among processes accessing the GPU and allowing multiple intranode processes to further overlap their CUDA tasks on the GPU and potentially share its resources through the Hyper-Q feature. Prior to MPS, Hyper-Q could only provide such resource sharing within a single process. In this paper, we evaluate the effect of the MPS service on the GPU communications with the focus on CUDA IPC and host-staged copies. We provide evidence that utilizing the MPS service is beneficial on multiple interprocess communications using these copy types. However, we show that efficient design decisions are required to further harness the potential of this service. To this aim, we propose a Static algorithm and Dynamic algorithm that can be applied to various intranode MPI collective operations, and as a test case we provide the results for the MPI_Allreduce operation. Both approaches, while following different algorithms, use a combination of the host-staged and CUDA IPC copies for the interprocess communications of their collective designs. By selecting the right number and type of the copies, our algorithms are capable of efficiently leveraging the MPS and Hyper-Q feature and provide improvement over MVAPICH2 and MVAPICH2-GDR for most of the medium and all of the large messages. Our results suggest that the Dynamic algorithm is comparable with the Static algorithm, while is independent of any tuning table and thus can be portable across platforms.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Hyper-Q aware intranode MPI collectives on the GPU\",\"authors\":\"Iman Faraji, A. Afsahi\",\"doi\":\"10.1145/2832241.2832247\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In GPU clusters, high GPU utilization and efficient communication play an important role in the performance of the MPI applications. To improve the GPU utilization, NVIDIA has introduced the Multi Process Service (MPS), eliminating the context-switching overhead among processes accessing the GPU and allowing multiple intranode processes to further overlap their CUDA tasks on the GPU and potentially share its resources through the Hyper-Q feature. Prior to MPS, Hyper-Q could only provide such resource sharing within a single process. In this paper, we evaluate the effect of the MPS service on the GPU communications with the focus on CUDA IPC and host-staged copies. We provide evidence that utilizing the MPS service is beneficial on multiple interprocess communications using these copy types. However, we show that efficient design decisions are required to further harness the potential of this service. To this aim, we propose a Static algorithm and Dynamic algorithm that can be applied to various intranode MPI collective operations, and as a test case we provide the results for the MPI_Allreduce operation. Both approaches, while following different algorithms, use a combination of the host-staged and CUDA IPC copies for the interprocess communications of their collective designs. By selecting the right number and type of the copies, our algorithms are capable of efficiently leveraging the MPS and Hyper-Q feature and provide improvement over MVAPICH2 and MVAPICH2-GDR for most of the medium and all of the large messages. Our results suggest that the Dynamic algorithm is comparable with the Static algorithm, while is independent of any tuning table and thus can be portable across platforms.\",\"PeriodicalId\":347945,\"journal\":{\"name\":\"ESPM '15\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ESPM '15\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2832241.2832247\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESPM '15","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2832241.2832247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hyper-Q aware intranode MPI collectives on the GPU
In GPU clusters, high GPU utilization and efficient communication play an important role in the performance of the MPI applications. To improve the GPU utilization, NVIDIA has introduced the Multi Process Service (MPS), eliminating the context-switching overhead among processes accessing the GPU and allowing multiple intranode processes to further overlap their CUDA tasks on the GPU and potentially share its resources through the Hyper-Q feature. Prior to MPS, Hyper-Q could only provide such resource sharing within a single process. In this paper, we evaluate the effect of the MPS service on the GPU communications with the focus on CUDA IPC and host-staged copies. We provide evidence that utilizing the MPS service is beneficial on multiple interprocess communications using these copy types. However, we show that efficient design decisions are required to further harness the potential of this service. To this aim, we propose a Static algorithm and Dynamic algorithm that can be applied to various intranode MPI collective operations, and as a test case we provide the results for the MPI_Allreduce operation. Both approaches, while following different algorithms, use a combination of the host-staged and CUDA IPC copies for the interprocess communications of their collective designs. By selecting the right number and type of the copies, our algorithms are capable of efficiently leveraging the MPS and Hyper-Q feature and provide improvement over MVAPICH2 and MVAPICH2-GDR for most of the medium and all of the large messages. Our results suggest that the Dynamic algorithm is comparable with the Static algorithm, while is independent of any tuning table and thus can be portable across platforms.