T. Hanawa, H. Fujii, N. Fujita, Tetsuya Odajima, Kazuya Matsumoto, Yuetsu Kodama, T. Boku
{"title":"Improving Strong-Scaling on GPU Cluster Based on Tightly Coupled Accelerators Architecture","authors":"T. Hanawa, H. Fujii, N. Fujita, Tetsuya Odajima, Kazuya Matsumoto, Yuetsu Kodama, T. Boku","doi":"10.1109/CLUSTER.2015.154","DOIUrl":null,"url":null,"abstract":"The Tightly Coupled Accelerators (TCA) architecture that we proposed in previous work enables direct communication between accelerators over nodes. In this paper, we present a proof-of-concept GPU cluster called the HA-PACS/TCA using the PEACH2 chip that we designed as an interconnection router chip based on the TCA architecture. Our system demonstrated 2.0 ?sec of latency on inter-node GPU-to-GPU communication with a PCIe Gen2 x8 by RDMA, reducing minimum latency to just 44% of the InfiniBand-QDR and MPI using GPUDirect for RDMA. Through results of Himeno benchmark tests, we demonstrated that our TCA architecture improved performance scalability with the small-sized problem by up to 61%.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"17 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The Tightly Coupled Accelerators (TCA) architecture that we proposed in previous work enables direct communication between accelerators over nodes. In this paper, we present a proof-of-concept GPU cluster called the HA-PACS/TCA using the PEACH2 chip that we designed as an interconnection router chip based on the TCA architecture. Our system demonstrated 2.0 ?sec of latency on inter-node GPU-to-GPU communication with a PCIe Gen2 x8 by RDMA, reducing minimum latency to just 44% of the InfiniBand-QDR and MPI using GPUDirect for RDMA. Through results of Himeno benchmark tests, we demonstrated that our TCA architecture improved performance scalability with the small-sized problem by up to 61%.