基于紧耦合加速器架构的GPU集群强伸缩改进

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI:10.1109/CLUSTER.2015.154

T. Hanawa, H. Fujii, N. Fujita, Tetsuya Odajima, Kazuya Matsumoto, Yuetsu Kodama, T. Boku

{"title":"基于紧耦合加速器架构的GPU集群强伸缩改进","authors":"T. Hanawa, H. Fujii, N. Fujita, Tetsuya Odajima, Kazuya Matsumoto, Yuetsu Kodama, T. Boku","doi":"10.1109/CLUSTER.2015.154","DOIUrl":null,"url":null,"abstract":"The Tightly Coupled Accelerators (TCA) architecture that we proposed in previous work enables direct communication between accelerators over nodes. In this paper, we present a proof-of-concept GPU cluster called the HA-PACS/TCA using the PEACH2 chip that we designed as an interconnection router chip based on the TCA architecture. Our system demonstrated 2.0 ?sec of latency on inter-node GPU-to-GPU communication with a PCIe Gen2 x8 by RDMA, reducing minimum latency to just 44% of the InfiniBand-QDR and MPI using GPUDirect for RDMA. Through results of Himeno benchmark tests, we demonstrated that our TCA architecture improved performance scalability with the small-sized problem by up to 61%.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"17 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving Strong-Scaling on GPU Cluster Based on Tightly Coupled Accelerators Architecture\",\"authors\":\"T. Hanawa, H. Fujii, N. Fujita, Tetsuya Odajima, Kazuya Matsumoto, Yuetsu Kodama, T. Boku\",\"doi\":\"10.1109/CLUSTER.2015.154\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Tightly Coupled Accelerators (TCA) architecture that we proposed in previous work enables direct communication between accelerators over nodes. In this paper, we present a proof-of-concept GPU cluster called the HA-PACS/TCA using the PEACH2 chip that we designed as an interconnection router chip based on the TCA architecture. Our system demonstrated 2.0 ?sec of latency on inter-node GPU-to-GPU communication with a PCIe Gen2 x8 by RDMA, reducing minimum latency to just 44% of the InfiniBand-QDR and MPI using GPUDirect for RDMA. Through results of Himeno benchmark tests, we demonstrated that our TCA architecture improved performance scalability with the small-sized problem by up to 61%.\",\"PeriodicalId\":187042,\"journal\":{\"name\":\"2015 IEEE International Conference on Cluster Computing\",\"volume\":\"17 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Conference on Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLUSTER.2015.154\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们在之前的工作中提出的紧耦合加速器(TCA)架构使节点上的加速器之间能够直接通信。在本文中，我们提出了一个名为HA-PACS/TCA的概念验证GPU集群，使用我们设计的基于TCA架构的互连路由器芯片PEACH2芯片。我们的系统显示，通过RDMA使用PCIe Gen2 x8进行节点间gpu到gpu通信的延迟为2.0秒，将最小延迟降低到仅为使用GPUDirect进行RDMA的InfiniBand-QDR和MPI的44%。通过Himeno基准测试的结果，我们证明了我们的TCA架构将小型问题的性能可伸缩性提高了61%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving Strong-Scaling on GPU Cluster Based on Tightly Coupled Accelerators Architecture

The Tightly Coupled Accelerators (TCA) architecture that we proposed in previous work enables direct communication between accelerators over nodes. In this paper, we present a proof-of-concept GPU cluster called the HA-PACS/TCA using the PEACH2 chip that we designed as an interconnection router chip based on the TCA architecture. Our system demonstrated 2.0 ?sec of latency on inter-node GPU-to-GPU communication with a PCIe Gen2 x8 by RDMA, reducing minimum latency to just 44% of the InfiniBand-QDR and MPI using GPUDirect for RDMA. Through results of Himeno benchmark tests, we demonstrated that our TCA architecture improved performance scalability with the small-sized problem by up to 61%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量