POSTER: Pagoda: A runtime system to maximize GPU utilization in data parallel tasks with limited parallelism

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI:10.1145/2967938.2974055

T. Yeh, Amit Sabne, Putt Sakdhnagool, R. Eigenmann, Timothy G. Rogers

{"title":"POSTER: Pagoda: A runtime system to maximize GPU utilization in data parallel tasks with limited parallelism","authors":"T. Yeh, Amit Sabne, Putt Sakdhnagool, R. Eigenmann, Timothy G. Rogers","doi":"10.1145/2967938.2974055","DOIUrl":null,"url":null,"abstract":"Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. Recognizing the issue, CUDA now allows 32 simultaneous tasks on GPUs; however, that still leaves significant room for underutilization. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 2.44x over PThreads running on a 20-core CPU, 1.43x over CUDA-HyperQ, and 1.33x over GeMTC, the state-of-the-art runtime GPU task scheduling system.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2974055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. Recognizing the issue, CUDA now allows 32 simultaneous tasks on GPUs; however, that still leaves significant room for underutilization. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 2.44x over PThreads running on a 20-core CPU, 1.43x over CUDA-HyperQ, and 1.33x over GeMTC, the state-of-the-art runtime GPU task scheduling system.

查看原文本刊更多论文

POSTER: Pagoda:一个运行时系统，在有限并行性的数据并行任务中最大化GPU利用率

大规模多线程gpu通过并行运行数千个线程来实现高吞吐量。为了充分利用硬件，现代工作负载通过启动大型任务来批量地将工作生成到GPU，其中每个任务是一个包含数千个线程的内核，占用整个GPU。gpu面临严重的利用率不足，如果任务很窄，即它们包含的线程少于512个，那么它们的性能优势就会消失。网络、信号和图像处理中对延迟敏感的应用程序使用相对较小的输入生成大量任务，就是这种有限并行性的例子。认识到这个问题，CUDA现在允许gpu上同时执行32个任务;然而，这仍有很大的利用空间。本文介绍了宝塔，一个运行时系统，虚拟化GPU资源，使用一个类似操作系统的守护进程内核，称为MasterKernel。任务在可用时从CPU衍生到Pagoda，并由MasterKernel按照warp粒度进行调度。这种级别的控制使GPU能够在发现空闲翘曲时保持调度和执行任务，从而显着减少利用率不足。在真实硬件上的实验结果表明，与在20核CPU上运行的PThreads相比，Pagoda实现了2.44倍的几何平均加速，比CUDA-HyperQ高1.43倍，比GeMTC高1.33倍，GeMTC是最先进的运行时GPU任务调度系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

自引率

0.00%

发文量