Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Hao Wang, S. Potluri, Miao Luo, A. Singh, Xiangyong Ouyang, S. Sur, D. Panda
{"title":"Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2","authors":"Hao Wang, S. Potluri, Miao Luo, A. Singh, Xiangyong Ouyang, S. Sur, D. Panda","doi":"10.1109/CLUSTER.2011.42","DOIUrl":null,"url":null,"abstract":"Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remains the biggest hurdle to overall performance and programmer productivity. Real scientific applications utilize multi-dimensional data. Data in higher dimensions may not be contiguous in memory. In order to improve programmer productivity and to enable communication libraries to optimize non-contiguous data communication, the MPI interface provides MPI data types. Currently, state of the art MPI libraries do not provide native data type support for data that resides in GPU memory. The management of non-contiguous GPU data is a source of productivity and performance loss, because GPU application developers have to manually move the data out of and in to GPUs. In this paper, we present our design for enabling high-performance communication support between GPUs for non-contiguous data types. We describe our innovative approach to improve performance by \"offloading\" data type packing and unpacking on to a GPU device, and \"pipelining\" all data transfer stages between two GPUs. Our design is integrated into the popular MVAPICH2 MPI library for InfiniBand, iWARP and RoCE clusters. We perform a detailed evaluation of our design on a GPU cluster with the latest NVIDIA Fermi GPU adapters. The evaluation reveals that the proposed designs can achieve up to 88% latency improvement for vector data type at 4 MB size with micro benchmarks. For Stencil2D application from the SHOC benchmark suite, our design can simplify the data communication in its main loop, reducing the lines of code by 36%. Further, our method can improve the performance of Stencil2D by up to 42% for single precision data set, and 39% for double precision data set. To the best of our knowledge, this is the first such design, implementation and evaluation of non-contiguous MPI data communication for GPU clusters.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"170 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"55","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2011.42","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 55

Abstract

Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remains the biggest hurdle to overall performance and programmer productivity. Real scientific applications utilize multi-dimensional data. Data in higher dimensions may not be contiguous in memory. In order to improve programmer productivity and to enable communication libraries to optimize non-contiguous data communication, the MPI interface provides MPI data types. Currently, state of the art MPI libraries do not provide native data type support for data that resides in GPU memory. The management of non-contiguous GPU data is a source of productivity and performance loss, because GPU application developers have to manually move the data out of and in to GPUs. In this paper, we present our design for enabling high-performance communication support between GPUs for non-contiguous data types. We describe our innovative approach to improve performance by "offloading" data type packing and unpacking on to a GPU device, and "pipelining" all data transfer stages between two GPUs. Our design is integrated into the popular MVAPICH2 MPI library for InfiniBand, iWARP and RoCE clusters. We perform a detailed evaluation of our design on a GPU cluster with the latest NVIDIA Fermi GPU adapters. The evaluation reveals that the proposed designs can achieve up to 88% latency improvement for vector data type at 4 MB size with micro benchmarks. For Stencil2D application from the SHOC benchmark suite, our design can simplify the data communication in its main loop, reducing the lines of code by 36%. Further, our method can improve the performance of Stencil2D by up to 42% for single precision data set, and 39% for double precision data set. To the best of our knowledge, this is the first such design, implementation and evaluation of non-contiguous MPI data communication for GPU clusters.
GPU集群的优化非连续MPI数据类型通信:MVAPICH2的设计,实现和评估
数据并行架构,如通用图形单元(gpgpu)在高端计算中的应用已经出现了巨大的增长。然而,进出gpgpu的数据移动仍然是整体性能和程序员生产力的最大障碍。真正的科学应用使用多维数据。高维数据在内存中可能不是连续的。为了提高程序员的工作效率,并使通信库能够优化非连续数据通信,MPI接口提供了MPI数据类型。目前,最先进的MPI库不为驻留在GPU内存中的数据提供本机数据类型支持。非连续GPU数据的管理是生产力和性能损失的一个来源,因为GPU应用程序开发人员必须手动将数据移出和移入GPU。在本文中,我们提出了我们的设计,使gpu之间的高性能通信支持非连续数据类型。我们描述了我们的创新方法,通过“卸载”数据类型打包和解包到GPU设备上,以及“流水线”两个GPU之间的所有数据传输阶段来提高性能。我们的设计集成到流行的MVAPICH2 MPI库中,用于InfiniBand, iWARP和RoCE集群。我们使用最新的NVIDIA Fermi GPU适配器在GPU集群上对我们的设计进行了详细的评估。评估显示,在4 MB大小的微基准测试中,所提出的设计可以实现高达88%的矢量数据类型延迟改进。对于来自SHOC基准套件的Stencil2D应用程序,我们的设计可以简化其主循环中的数据通信,减少36%的代码行数。此外,我们的方法可以将Stencil2D在单精度数据集上的性能提高42%,在双精度数据集上的性能提高39%。据我们所知,这是第一次为GPU集群设计、实现和评估非连续MPI数据通信。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信