MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI:10.1109/CLUSTER.2011.67

A. Singh, S. Potluri, Hao Wang, K. Kandalla, S. Sur, D. Panda

{"title":"MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit","authors":"A. Singh, S. Potluri, Hao Wang, K. Kandalla, S. Sur, D. Panda","doi":"10.1109/CLUSTER.2011.67","DOIUrl":null,"url":null,"abstract":"General Purpose Graphics Processing Units (GPGPUs) are rapidly becoming an integral part of high performance system architectures. The Tianhe-1A and Tsubame systems received significant attention for their architectures that leverage GPGPUs. Increasingly many scientific applications that were originally written for CPUs using MPI for parallelism are being ported to these hybrid CPU-GPU clusters. In the traditional sense, CPUs perform computation while the MPI library takes care of communication. When computation is performed on GPGPUs, the data has to be moved from device memory to main memory before it can be used in communication. Though GPGPUs provide huge compute potential, the data movement to and from GPGPUs is both a performance and productivity bottleneck. Recently, the MVAPICH2 MPI library has been modified to directly support point-to-point MPI communication from the GPU memory [1]. Using this support, programmers do not need to explicitly move data to main memory before using MPI. This feature also enables performance improvement due to tight integration of GPU data movement and MPI internal protocols. Typically, scientific applications spend a significant portion of their execution time in collective communication. Hence, optimizing performance of collectives has a significant impact on their performance. MPI_Alltoall is a heavily used collective that has O(N2) communication, for N processes. In this paper, we outline the major design alternatives for MPI_Alltoall collective communication operation on GPGPU clusters. We propose three design alternatives and provide a corresponding performance analysis. Using our dynamic staging techniques, the latency of MPI_Alltoall on GPU clusters can be improved by 44% over a user level implementation and 31% over a send-recv based implementation for 256 KByte messages on 8 processes.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2011.67","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

General Purpose Graphics Processing Units (GPGPUs) are rapidly becoming an integral part of high performance system architectures. The Tianhe-1A and Tsubame systems received significant attention for their architectures that leverage GPGPUs. Increasingly many scientific applications that were originally written for CPUs using MPI for parallelism are being ported to these hybrid CPU-GPU clusters. In the traditional sense, CPUs perform computation while the MPI library takes care of communication. When computation is performed on GPGPUs, the data has to be moved from device memory to main memory before it can be used in communication. Though GPGPUs provide huge compute potential, the data movement to and from GPGPUs is both a performance and productivity bottleneck. Recently, the MVAPICH2 MPI library has been modified to directly support point-to-point MPI communication from the GPU memory [1]. Using this support, programmers do not need to explicitly move data to main memory before using MPI. This feature also enables performance improvement due to tight integration of GPU data movement and MPI internal protocols. Typically, scientific applications spend a significant portion of their execution time in collective communication. Hence, optimizing performance of collectives has a significant impact on their performance. MPI_Alltoall is a heavily used collective that has O(N2) communication, for N processes. In this paper, we outline the major design alternatives for MPI_Alltoall collective communication operation on GPGPU clusters. We propose three design alternatives and provide a corresponding performance analysis. Using our dynamic staging techniques, the latency of MPI_Alltoall on GPU clusters can be improved by 44% over a user level implementation and 31% over a send-recv based implementation for 256 KByte messages on 8 processes.

查看原文本刊更多论文

GPGPU集群上的MPI全个性化交换:设计方案和收益

通用图形处理单元(gpgpu)正迅速成为高性能系统架构中不可或缺的一部分。天河- 1a和Tsubame系统因其利用gpgpu的架构而受到重大关注。越来越多最初为使用MPI实现并行性的cpu编写的科学应用程序被移植到这些CPU-GPU混合集群中。在传统意义上，cpu执行计算，而MPI库负责通信。当在gpgpu上执行计算时，数据必须从设备内存移动到主内存，然后才能用于通信。虽然gpgpu提供了巨大的计算潜力，但数据在gpgpu之间的传输是性能和生产力的瓶颈。最近，MVAPICH2 MPI库已经被修改为直接支持来自GPU内存的点对点MPI通信[1]。使用这种支持，程序员不需要在使用MPI之前显式地将数据移动到主存。由于GPU数据移动和MPI内部协议的紧密集成，该功能还可以提高性能。通常，科学应用程序在集体通信中花费了相当大一部分执行时间。因此，优化集体绩效对集体绩效有重要影响。MPI_Alltoall是一个被大量使用的集合，对于N个进程具有O(N2)通信。在本文中，我们概述了GPGPU集群上MPI_Alltoall集体通信操作的主要设计方案。我们提出了三种设计方案，并提供了相应的性能分析。使用我们的动态分级技术，GPU集群上的MPI_Alltoall的延迟可以比用户级实现提高44%，比基于发送-接收的实现提高31%，对于8个进程上的256 KByte消息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量