Communication protocol optimization for enhanced GPU performance

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development Pub Date : 2020-01-16 DOI:10.1147/JRD.2020.2967311

S. S. Sharkawi;G. A. Chochia

{"title":"Communication protocol optimization for enhanced GPU performance","authors":"S. S. Sharkawi;G. A. Chochia","doi":"10.1147/JRD.2020.2967311","DOIUrl":null,"url":null,"abstract":"The U.S. Department of Energy CORAL program systems SUMMIT and SIERRA are based on hybrid servers comprising IBM POWER9 CPUs and NVIDIA V100 graphics processing units (GPUs) connected by two extended data rate (EDR) links to a high-speed InfiniBand Network. A major challenge to the communication software stack is to optimize performance for all combinations of data origin and destination: host or GPU memory, same or different server. Alternate paths exist for routing data from GPU memory. When origin and destination are on different servers, it can be sent either via host memory or bypassing host memory with GPU direct feature. When origin and destination are on the same server, host memory can be bypassed with peer-to-peer inter process communication (IPC). For large messages pipelining makes host memory data path competitive with GPU direct. In this article, we explain the techniques used in Spectrum MPI Parallel Active Message Interface layer to cache memory types and attributes in order to reduce the overhead associated with calling the CUDA application programming interface (API); in addition, we detail the different protocols used for different memory types, device memory, managed memory, and host memory. To illustrate, the caching technique achieved a device-to-device latency improvement of 26% for intranode transfers and 19% for internode transfers.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"9:1-9:9"},"PeriodicalIF":1.3000,"publicationDate":"2020-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2020.2967311","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM Journal of Research and Development","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/8961130/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 5

Abstract

The U.S. Department of Energy CORAL program systems SUMMIT and SIERRA are based on hybrid servers comprising IBM POWER9 CPUs and NVIDIA V100 graphics processing units (GPUs) connected by two extended data rate (EDR) links to a high-speed InfiniBand Network. A major challenge to the communication software stack is to optimize performance for all combinations of data origin and destination: host or GPU memory, same or different server. Alternate paths exist for routing data from GPU memory. When origin and destination are on different servers, it can be sent either via host memory or bypassing host memory with GPU direct feature. When origin and destination are on the same server, host memory can be bypassed with peer-to-peer inter process communication (IPC). For large messages pipelining makes host memory data path competitive with GPU direct. In this article, we explain the techniques used in Spectrum MPI Parallel Active Message Interface layer to cache memory types and attributes in order to reduce the overhead associated with calling the CUDA application programming interface (API); in addition, we detail the different protocols used for different memory types, device memory, managed memory, and host memory. To illustrate, the caching technique achieved a device-to-device latency improvement of 26% for intranode transfers and 19% for internode transfers.

查看原文本刊更多论文

通信协议优化，增强GPU性能

美国能源部CORAL项目系统SUMMIT和SIERRA基于混合服务器，包括IBM POWER9 cpu和NVIDIA V100图形处理单元(gpu)，通过两条扩展数据速率(EDR)链路连接到高速InfiniBand网络。通信软件栈面临的一个主要挑战是优化所有数据源和目的地组合的性能:主机或GPU内存，相同或不同的服务器。存在从GPU内存路由数据的替代路径。当原点和目的地在不同的服务器上时，它可以通过主机内存发送，也可以绕过具有GPU直接功能的主机内存发送。当源和目标位于同一台服务器上时，可以通过点对点进程间通信(IPC)绕过主机内存。对于大型消息，流水线使得主机内存数据路径与GPU直接竞争。在本文中，我们解释了在Spectrum MPI并行活动消息接口层中使用的技术来缓存内存类型和属性，以减少与调用CUDA应用程序编程接口(API)相关的开销;此外，我们还详细介绍了用于不同内存类型、设备内存、托管内存和主机内存的不同协议。为了说明这一点，缓存技术实现了设备到设备延迟的改进，对于内部节点传输提高了26%，对于节点间传输提高了19%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IBM Journal of Research and Development 工程技术-计算机：硬件

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： The IBM Journal of Research and Development is a peer-reviewed technical journal, published bimonthly, which features the work of authors in the science, technology and engineering of information systems. Papers are written for the worldwide scientific research and development community and knowledgeable professionals. Submitted papers are welcome from the IBM technical community and from non-IBM authors on topics relevant to the scientific and technical content of the Journal.