T. Hanawa, H. Fujii, N. Fujita, Tetsuya Odajima, Kazuya Matsumoto, Yuetsu Kodama, T. Boku
{"title":"基于紧耦合加速器架构的GPU集群强伸缩改进","authors":"T. Hanawa, H. Fujii, N. Fujita, Tetsuya Odajima, Kazuya Matsumoto, Yuetsu Kodama, T. Boku","doi":"10.1109/CLUSTER.2015.154","DOIUrl":null,"url":null,"abstract":"The Tightly Coupled Accelerators (TCA) architecture that we proposed in previous work enables direct communication between accelerators over nodes. In this paper, we present a proof-of-concept GPU cluster called the HA-PACS/TCA using the PEACH2 chip that we designed as an interconnection router chip based on the TCA architecture. Our system demonstrated 2.0 ?sec of latency on inter-node GPU-to-GPU communication with a PCIe Gen2 x8 by RDMA, reducing minimum latency to just 44% of the InfiniBand-QDR and MPI using GPUDirect for RDMA. Through results of Himeno benchmark tests, we demonstrated that our TCA architecture improved performance scalability with the small-sized problem by up to 61%.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"17 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving Strong-Scaling on GPU Cluster Based on Tightly Coupled Accelerators Architecture\",\"authors\":\"T. Hanawa, H. Fujii, N. Fujita, Tetsuya Odajima, Kazuya Matsumoto, Yuetsu Kodama, T. Boku\",\"doi\":\"10.1109/CLUSTER.2015.154\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Tightly Coupled Accelerators (TCA) architecture that we proposed in previous work enables direct communication between accelerators over nodes. In this paper, we present a proof-of-concept GPU cluster called the HA-PACS/TCA using the PEACH2 chip that we designed as an interconnection router chip based on the TCA architecture. Our system demonstrated 2.0 ?sec of latency on inter-node GPU-to-GPU communication with a PCIe Gen2 x8 by RDMA, reducing minimum latency to just 44% of the InfiniBand-QDR and MPI using GPUDirect for RDMA. Through results of Himeno benchmark tests, we demonstrated that our TCA architecture improved performance scalability with the small-sized problem by up to 61%.\",\"PeriodicalId\":187042,\"journal\":{\"name\":\"2015 IEEE International Conference on Cluster Computing\",\"volume\":\"17 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Conference on Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLUSTER.2015.154\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improving Strong-Scaling on GPU Cluster Based on Tightly Coupled Accelerators Architecture
The Tightly Coupled Accelerators (TCA) architecture that we proposed in previous work enables direct communication between accelerators over nodes. In this paper, we present a proof-of-concept GPU cluster called the HA-PACS/TCA using the PEACH2 chip that we designed as an interconnection router chip based on the TCA architecture. Our system demonstrated 2.0 ?sec of latency on inter-node GPU-to-GPU communication with a PCIe Gen2 x8 by RDMA, reducing minimum latency to just 44% of the InfiniBand-QDR and MPI using GPUDirect for RDMA. Through results of Himeno benchmark tests, we demonstrated that our TCA architecture improved performance scalability with the small-sized problem by up to 61%.