Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications Pub Date : 2023-08-11 DOI:10.1177/10943420231188079

Marc Gonzalez Tallada, E. Morancho

{"title":"Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications","authors":"Marc Gonzalez Tallada, E. Morancho","doi":"10.1177/10943420231188079","DOIUrl":null,"url":null,"abstract":"Hybrid computer systems combine compute units (CUs) of different nature like CPUs, GPUs and FPGAs. Simultaneously exploiting the computing power of these CUs requires a careful decomposition of the applications into balanced parallel tasks according to both the performance of each CU type and the communication costs among them. This paper describes the design and implementation of runtime support for OpenMP hybrid GPU-CPU applications, when mixed with GPU-oriented programming models (e.g. CUDA/HIP). The paper describes the case for a hybrid multi-level parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to compute units of different nature (GPUs and CPUs). The paper describes the implementation of runtime support to bridge OpenMP and HIP, introducing the abstractions of Computing Unit and Data Placement. We compare hybrid and non-hybrid executions under state-of-the-art schedulers for OpenMP: static and dynamic task schedulings. Then, we improve the set of schedulers with two additional variants: a memorizing-dynamic task scheduling and a profile-based static task scheduling. On a computing node composed of one AMD EPYC 7742 @ 2.250 GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 × GPU AMD Radeon Instinct MI50 with 32 GB, hybrid executions present speedups from 1.10× up to 3.5× with respect to a non-hybrid GPU implementation, depending on the number of activated CUs.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"626 - 646"},"PeriodicalIF":3.5000,"publicationDate":"2023-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of High Performance Computing Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1177/10943420231188079","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Hybrid computer systems combine compute units (CUs) of different nature like CPUs, GPUs and FPGAs. Simultaneously exploiting the computing power of these CUs requires a careful decomposition of the applications into balanced parallel tasks according to both the performance of each CU type and the communication costs among them. This paper describes the design and implementation of runtime support for OpenMP hybrid GPU-CPU applications, when mixed with GPU-oriented programming models (e.g. CUDA/HIP). The paper describes the case for a hybrid multi-level parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to compute units of different nature (GPUs and CPUs). The paper describes the implementation of runtime support to bridge OpenMP and HIP, introducing the abstractions of Computing Unit and Data Placement. We compare hybrid and non-hybrid executions under state-of-the-art schedulers for OpenMP: static and dynamic task schedulings. Then, we improve the set of schedulers with two additional variants: a memorizing-dynamic task scheduling and a profile-based static task scheduling. On a computing node composed of one AMD EPYC 7742 @ 2.250 GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 × GPU AMD Radeon Instinct MI50 with 32 GB, hybrid executions present speedups from 1.10× up to 3.5× with respect to a non-hybrid GPU implementation, depending on the number of activated CUs.

查看原文本刊更多论文

使用OpenMP和CUDA/HIP的异构编程用于CPU-GPU混合科学应用

混合计算机系统结合了不同性质的计算单元（CU），如CPU、GPU和FPGA。同时利用这些CU的计算能力需要根据每种CU类型的性能和它们之间的通信成本将应用程序仔细分解为平衡的并行任务。本文描述了当与面向GPU的编程模型（例如CUDA/HIP）混合时，OpenMP混合GPU-CPU应用程序的运行时支持的设计和实现。本文描述了NPB-MZ基准套件的混合多级并行化的情况。该实现利用了粗粒度和细粒度并行性，映射到不同性质的计算单元（GPU和CPU）。本文描述了运行时支持的实现，以桥接OpenMP和HIP，介绍了计算单元和数据放置的抽象。我们比较了在最先进的OpenMP调度器下的混合和非混合执行：静态和动态任务调度。然后，我们用两个额外的变体改进了调度器集：记忆动态任务调度和基于配置文件的静态任务调度。在由一个AMD EPYC 7742@2.250 GHz（64核和2个线程/核，每个节点总计128个线程）和2×GPU AMD Radeon Instinct MI50（32 GB）组成的计算节点上，混合执行相对于非混合GPU实现呈现从1.10×到3.5×的加速，具体取决于激活的CU的数量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of High Performance Computing Applications 工程技术-计算机：跨学科应用

CiteScore

6.10

自引率

6.50%

发文量

审稿时长

>12 weeks

期刊介绍： With ever increasing pressure for health services in all countries to meet rising demands, improve their quality and efficiency, and to be more accountable; the need for rigorous research and policy analysis has never been greater. The Journal of Health Services Research & Policy presents the latest scientific research, insightful overviews and reflections on underlying issues, and innovative, thought provoking contributions from leading academics and policy-makers. It provides ideas and hope for solving dilemmas that confront all countries.