Akshay Venkatesh, H. Subramoni, Khaled Hamidouche, D. Panda
{"title":"一种高性能广播设计,采用硬件组播和GPUDirect RDMA,适用于Infiniband集群上的流媒体应用","authors":"Akshay Venkatesh, H. Subramoni, Khaled Hamidouche, D. Panda","doi":"10.1109/HiPC.2014.7116875","DOIUrl":null,"url":null,"abstract":"Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applications to scale to rapidly. A frequently used operation that constitutes to the execution of multi-node streaming applications is the broadcast operation where data from a single source is transmitted to multiple sinks, typically from a live data site. Although high performance networks like IB offer novel features like hardware based multicast to speed up the performance of the broadcast operation, their benefits have been limited to host based applications due to the inability of IB Host Channel Adapters (HCAs) to directly access the memory of the GPGPUs. This poses a significant performance bottleneck to high performance streaming applications that rely heavily on broadcast operations from GPU memories. The recently introduced GPUDirect RDMA feature alleviates this bottleneck by enabling IB HCAs to perform data transfers directly to / from GPU memory (bypassing host memory). Thus, it presents an attractive alternative to designing high performance broadcast operations for GPGPU based high performance streaming applications. In this work, we propose a novel method for fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design a high performance broadcast operation for streaming applications. The experiments conducted with the proposed design show up 60% decrease in latency and 3X-4X improvement in a throughput benchmark compared to the naive scheme on 64 GPU nodes.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters\",\"authors\":\"Akshay Venkatesh, H. Subramoni, Khaled Hamidouche, D. Panda\",\"doi\":\"10.1109/HiPC.2014.7116875\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applications to scale to rapidly. A frequently used operation that constitutes to the execution of multi-node streaming applications is the broadcast operation where data from a single source is transmitted to multiple sinks, typically from a live data site. Although high performance networks like IB offer novel features like hardware based multicast to speed up the performance of the broadcast operation, their benefits have been limited to host based applications due to the inability of IB Host Channel Adapters (HCAs) to directly access the memory of the GPGPUs. This poses a significant performance bottleneck to high performance streaming applications that rely heavily on broadcast operations from GPU memories. The recently introduced GPUDirect RDMA feature alleviates this bottleneck by enabling IB HCAs to perform data transfers directly to / from GPU memory (bypassing host memory). Thus, it presents an attractive alternative to designing high performance broadcast operations for GPGPU based high performance streaming applications. In this work, we propose a novel method for fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design a high performance broadcast operation for streaming applications. The experiments conducted with the proposed design show up 60% decrease in latency and 3X-4X improvement in a throughput benchmark compared to the naive scheme on 64 GPU nodes.\",\"PeriodicalId\":337777,\"journal\":{\"name\":\"2014 21st International Conference on High Performance Computing (HiPC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 21st International Conference on High Performance Computing (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC.2014.7116875\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 21st International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2014.7116875","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters
Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applications to scale to rapidly. A frequently used operation that constitutes to the execution of multi-node streaming applications is the broadcast operation where data from a single source is transmitted to multiple sinks, typically from a live data site. Although high performance networks like IB offer novel features like hardware based multicast to speed up the performance of the broadcast operation, their benefits have been limited to host based applications due to the inability of IB Host Channel Adapters (HCAs) to directly access the memory of the GPGPUs. This poses a significant performance bottleneck to high performance streaming applications that rely heavily on broadcast operations from GPU memories. The recently introduced GPUDirect RDMA feature alleviates this bottleneck by enabling IB HCAs to perform data transfers directly to / from GPU memory (bypassing host memory). Thus, it presents an attractive alternative to designing high performance broadcast operations for GPGPU based high performance streaming applications. In this work, we propose a novel method for fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design a high performance broadcast operation for streaming applications. The experiments conducted with the proposed design show up 60% decrease in latency and 3X-4X improvement in a throughput benchmark compared to the naive scheme on 64 GPU nodes.