Hybrid Communication with TCA and InfiniBand on a Parallel Programming Language XcalableACC for GPU Clusters

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI:10.1109/CLUSTER.2015.112

Tetsuya Odajima, T. Boku, T. Hanawa, H. Murai, M. Nakao, Akihiro Tabuchi, M. Sato

{"title":"Hybrid Communication with TCA and InfiniBand on a Parallel Programming Language XcalableACC for GPU Clusters","authors":"Tetsuya Odajima, T. Boku, T. Hanawa, H. Murai, M. Nakao, Akihiro Tabuchi, M. Sato","doi":"10.1109/CLUSTER.2015.112","DOIUrl":null,"url":null,"abstract":"For the execution of parallel HPC applications on GPU-ready clusters, high communication latency between GPUs over nodes will be a serious problem on strong scalability. To reduce the communication latency between GPUs, we proposed the Tightly Coupled Accelerator (TCA) architecture and developed the PEACH2 board as a proof-of-concept interconnection system for TCA. Although PEACH2 provides very low communication latency, there are some hardware limitations due to its implementation depending on PCIe technology, such as the practical number of nodes in a system which is 16 currently named sub-cluster. More number of nodes should be connected by conventional interconnections such as InfiniBand, and the entire network system is configured as a hybrid one with global conventional network and local high-speed network by PEACH2. For ease of user programmability, it is desirable to operate such a complicated communication system at the library or language level (which hides the system). In this paper, we develop a hybrid interconnection network system combining PEACH2 and InfiniBand, and implement it based on a high-level PGAS language for accelerated clusters named XcalableACC (XACC). A preliminary performance evaluation confirms that the hybrid network improves the performance based on the Himeno benchmark for stencil computation by up to 40%, relative to MVAPICH2 with GDR on InfiniBand. Additionally, Allgather collective communication with a hybrid network improves the performance by up to 50% for networks of 8 to 16 nodes. The combination of local communication, supported by the low latency of PEACH2 and global communication supported by the high bandwidth and scalability of InfiniBand, results in an improvement of overall performance.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

For the execution of parallel HPC applications on GPU-ready clusters, high communication latency between GPUs over nodes will be a serious problem on strong scalability. To reduce the communication latency between GPUs, we proposed the Tightly Coupled Accelerator (TCA) architecture and developed the PEACH2 board as a proof-of-concept interconnection system for TCA. Although PEACH2 provides very low communication latency, there are some hardware limitations due to its implementation depending on PCIe technology, such as the practical number of nodes in a system which is 16 currently named sub-cluster. More number of nodes should be connected by conventional interconnections such as InfiniBand, and the entire network system is configured as a hybrid one with global conventional network and local high-speed network by PEACH2. For ease of user programmability, it is desirable to operate such a complicated communication system at the library or language level (which hides the system). In this paper, we develop a hybrid interconnection network system combining PEACH2 and InfiniBand, and implement it based on a high-level PGAS language for accelerated clusters named XcalableACC (XACC). A preliminary performance evaluation confirms that the hybrid network improves the performance based on the Himeno benchmark for stencil computation by up to 40%, relative to MVAPICH2 with GDR on InfiniBand. Additionally, Allgather collective communication with a hybrid network improves the performance by up to 50% for networks of 8 to 16 nodes. The combination of local communication, supported by the low latency of PEACH2 and global communication supported by the high bandwidth and scalability of InfiniBand, results in an improvement of overall performance.

查看原文本刊更多论文

基于并行编程语言XcalableACC的GPU集群TCA和ib混合通信

对于在gpu就绪的集群上执行并行HPC应用程序，节点上gpu之间的高通信延迟将成为强可扩展性的一个严重问题。为了减少gpu之间的通信延迟，我们提出了紧耦合加速器(TCA)架构，并开发了PEACH2板作为TCA的概念验证互连系统。尽管PEACH2提供了非常低的通信延迟，但由于其实现依赖于PCIe技术，因此存在一些硬件限制，例如系统中实际的节点数量目前为16个命名的子集群。采用InfiniBand等常规互连方式连接更多的节点，通过PEACH2将整个网络系统配置为全球常规网络和本地高速网络的混合网络。为了便于用户编程，在库或语言级别(隐藏系统)操作这样一个复杂的通信系统是可取的。本文开发了一种结合PEACH2和InfiniBand的混合互连网络系统，并基于高级PGAS语言实现了XcalableACC (XACC)加速集群。初步的性能评估证实，与基于InfiniBand的GDR的MVAPICH2相比，基于Himeno基准的混合网络将模板计算的性能提高了40%。此外，对于8 ~ 16节点的网络，采用混合网络的Allgather集体通信可将性能提高50%。将PEACH2的低延迟支持的本地通信与InfiniBand的高带宽和可扩展性支持的全局通信相结合，从而提高了整体性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量