Hybrid Communication with TCA and InfiniBand on a Parallel Programming Language XcalableACC for GPU Clusters

Tetsuya Odajima, T. Boku, T. Hanawa, H. Murai, M. Nakao, Akihiro Tabuchi, M. Sato
{"title":"Hybrid Communication with TCA and InfiniBand on a Parallel Programming Language XcalableACC for GPU Clusters","authors":"Tetsuya Odajima, T. Boku, T. Hanawa, H. Murai, M. Nakao, Akihiro Tabuchi, M. Sato","doi":"10.1109/CLUSTER.2015.112","DOIUrl":null,"url":null,"abstract":"For the execution of parallel HPC applications on GPU-ready clusters, high communication latency between GPUs over nodes will be a serious problem on strong scalability. To reduce the communication latency between GPUs, we proposed the Tightly Coupled Accelerator (TCA) architecture and developed the PEACH2 board as a proof-of-concept interconnection system for TCA. Although PEACH2 provides very low communication latency, there are some hardware limitations due to its implementation depending on PCIe technology, such as the practical number of nodes in a system which is 16 currently named sub-cluster. More number of nodes should be connected by conventional interconnections such as InfiniBand, and the entire network system is configured as a hybrid one with global conventional network and local high-speed network by PEACH2. For ease of user programmability, it is desirable to operate such a complicated communication system at the library or language level (which hides the system). In this paper, we develop a hybrid interconnection network system combining PEACH2 and InfiniBand, and implement it based on a high-level PGAS language for accelerated clusters named XcalableACC (XACC). A preliminary performance evaluation confirms that the hybrid network improves the performance based on the Himeno benchmark for stencil computation by up to 40%, relative to MVAPICH2 with GDR on InfiniBand. Additionally, Allgather collective communication with a hybrid network improves the performance by up to 50% for networks of 8 to 16 nodes. The combination of local communication, supported by the low latency of PEACH2 and global communication supported by the high bandwidth and scalability of InfiniBand, results in an improvement of overall performance.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

For the execution of parallel HPC applications on GPU-ready clusters, high communication latency between GPUs over nodes will be a serious problem on strong scalability. To reduce the communication latency between GPUs, we proposed the Tightly Coupled Accelerator (TCA) architecture and developed the PEACH2 board as a proof-of-concept interconnection system for TCA. Although PEACH2 provides very low communication latency, there are some hardware limitations due to its implementation depending on PCIe technology, such as the practical number of nodes in a system which is 16 currently named sub-cluster. More number of nodes should be connected by conventional interconnections such as InfiniBand, and the entire network system is configured as a hybrid one with global conventional network and local high-speed network by PEACH2. For ease of user programmability, it is desirable to operate such a complicated communication system at the library or language level (which hides the system). In this paper, we develop a hybrid interconnection network system combining PEACH2 and InfiniBand, and implement it based on a high-level PGAS language for accelerated clusters named XcalableACC (XACC). A preliminary performance evaluation confirms that the hybrid network improves the performance based on the Himeno benchmark for stencil computation by up to 40%, relative to MVAPICH2 with GDR on InfiniBand. Additionally, Allgather collective communication with a hybrid network improves the performance by up to 50% for networks of 8 to 16 nodes. The combination of local communication, supported by the low latency of PEACH2 and global communication supported by the high bandwidth and scalability of InfiniBand, results in an improvement of overall performance.
基于并行编程语言XcalableACC的GPU集群TCA和ib混合通信
对于在gpu就绪的集群上执行并行HPC应用程序,节点上gpu之间的高通信延迟将成为强可扩展性的一个严重问题。为了减少gpu之间的通信延迟,我们提出了紧耦合加速器(TCA)架构,并开发了PEACH2板作为TCA的概念验证互连系统。尽管PEACH2提供了非常低的通信延迟,但由于其实现依赖于PCIe技术,因此存在一些硬件限制,例如系统中实际的节点数量目前为16个命名的子集群。采用InfiniBand等常规互连方式连接更多的节点,通过PEACH2将整个网络系统配置为全球常规网络和本地高速网络的混合网络。为了便于用户编程,在库或语言级别(隐藏系统)操作这样一个复杂的通信系统是可取的。本文开发了一种结合PEACH2和InfiniBand的混合互连网络系统,并基于高级PGAS语言实现了XcalableACC (XACC)加速集群。初步的性能评估证实,与基于InfiniBand的GDR的MVAPICH2相比,基于Himeno基准的混合网络将模板计算的性能提高了40%。此外,对于8 ~ 16节点的网络,采用混合网络的Allgather集体通信可将性能提高50%。将PEACH2的低延迟支持的本地通信与InfiniBand的高带宽和可扩展性支持的全局通信相结合,从而提高了整体性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信