Hybrid CPU/GPU tasks optimized for concurrency in OpenMP

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development Pub Date : 2019-12-17 DOI:10.1147/JRD.2019.2960245

A. E. Eichenberger;G.-T. Bercea;A. Bataev;L. Grinberg;J. K. O'Brien

{"title":"Hybrid CPU/GPU tasks optimized for concurrency in OpenMP","authors":"A. E. Eichenberger;G.-T. Bercea;A. Bataev;L. Grinberg;J. K. O'Brien","doi":"10.1147/JRD.2019.2960245","DOIUrl":null,"url":null,"abstract":"Sierra and Summit supercomputers exhibit a significant amount of intranode parallelism between the host POWER9 CPUs and their attached GPU devices. In this article, we show that exploiting device-level parallelism is key to achieving high performance by reducing overheads typically associated with CPU and GPU task execution. Moreover, manually exploiting this type of parallelism in large-scale applications is nontrivial and error-prone. We hide the complexity of exploiting this hybrid intranode parallelism using the OpenMP programming model abstraction. The implementation leverages the semantics of OpenMP tasks to express asynchronous task computations and their associated dependences. Launching tasks on the CPU threads requires a careful design of work-stealing algorithms to provide efficient load balancing among CPU threads. We propose a novel algorithm that removes locks from all task queueing operations that are on the critical path. Tasks assigned to GPU devices require additional steps such as copying input data to GPU devices, launching the computation kernels, and copying data back to the host CPU memory. We perform key optimizations to reduce the cost of these additional steps by tightly integrating data transfers and GPU computations into streams of asynchronous GPU operations. We further map high-level dependences between GPU tasks to the same asynchronous GPU streams to further avoid unnecessary synchronization. Results validate our approach.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"13:1-13:14"},"PeriodicalIF":1.3000,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960245","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM Journal of Research and Development","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/8935508/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 1

Abstract

Sierra and Summit supercomputers exhibit a significant amount of intranode parallelism between the host POWER9 CPUs and their attached GPU devices. In this article, we show that exploiting device-level parallelism is key to achieving high performance by reducing overheads typically associated with CPU and GPU task execution. Moreover, manually exploiting this type of parallelism in large-scale applications is nontrivial and error-prone. We hide the complexity of exploiting this hybrid intranode parallelism using the OpenMP programming model abstraction. The implementation leverages the semantics of OpenMP tasks to express asynchronous task computations and their associated dependences. Launching tasks on the CPU threads requires a careful design of work-stealing algorithms to provide efficient load balancing among CPU threads. We propose a novel algorithm that removes locks from all task queueing operations that are on the critical path. Tasks assigned to GPU devices require additional steps such as copying input data to GPU devices, launching the computation kernels, and copying data back to the host CPU memory. We perform key optimizations to reduce the cost of these additional steps by tightly integrating data transfers and GPU computations into streams of asynchronous GPU operations. We further map high-level dependences between GPU tasks to the same asynchronous GPU streams to further avoid unnecessary synchronization. Results validate our approach.

查看原文本刊更多论文

在OpenMP中为并发性优化的混合CPU/GPU任务

Sierra和Summit超级计算机在主机POWER9 CPU及其连接的GPU设备之间表现出大量的内部节点并行性。在本文中，我们展示了利用设备级并行性是通过减少通常与CPU和GPU任务执行相关的开销来实现高性能的关键。此外，在大规模应用程序中手动利用这种类型的并行性是不平凡的，而且容易出错。我们使用OpenMP编程模型抽象来隐藏利用这种混合内部节点并行性的复杂性。该实现利用OpenMP任务的语义来表达异步任务计算及其相关的依赖关系。在CPU线程上启动任务需要仔细设计工作窃取算法，以在CPU线程之间提供有效的负载平衡。我们提出了一种新的算法，可以从关键路径上的所有任务排队操作中移除锁。分配给GPU设备的任务需要额外的步骤，例如将输入数据复制到GPU设备、启动计算内核以及将数据复制回主机CPU存储器。我们通过将数据传输和GPU计算紧密集成到异步GPU操作流中来执行关键优化，以降低这些额外步骤的成本。我们进一步将GPU任务之间的高级依赖关系映射到相同的异步GPU流，以进一步避免不必要的同步。结果验证了我们的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IBM Journal of Research and Development 工程技术-计算机：硬件

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： The IBM Journal of Research and Development is a peer-reviewed technical journal, published bimonthly, which features the work of authors in the science, technology and engineering of information systems. Papers are written for the worldwide scientific research and development community and knowledgeable professionals. Submitted papers are welcome from the IBM technical community and from non-IBM authors on topics relevant to the scientific and technical content of the Journal.