POSTER: Pagoda: A runtime system to maximize GPU utilization in data parallel tasks with limited parallelism

T. Yeh, Amit Sabne, Putt Sakdhnagool, R. Eigenmann, Timothy G. Rogers
{"title":"POSTER: Pagoda: A runtime system to maximize GPU utilization in data parallel tasks with limited parallelism","authors":"T. Yeh, Amit Sabne, Putt Sakdhnagool, R. Eigenmann, Timothy G. Rogers","doi":"10.1145/2967938.2974055","DOIUrl":null,"url":null,"abstract":"Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. Recognizing the issue, CUDA now allows 32 simultaneous tasks on GPUs; however, that still leaves significant room for underutilization. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 2.44x over PThreads running on a 20-core CPU, 1.43x over CUDA-HyperQ, and 1.33x over GeMTC, the state-of-the-art runtime GPU task scheduling system.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2974055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. Recognizing the issue, CUDA now allows 32 simultaneous tasks on GPUs; however, that still leaves significant room for underutilization. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 2.44x over PThreads running on a 20-core CPU, 1.43x over CUDA-HyperQ, and 1.33x over GeMTC, the state-of-the-art runtime GPU task scheduling system.
POSTER: Pagoda:一个运行时系统,在有限并行性的数据并行任务中最大化GPU利用率
大规模多线程gpu通过并行运行数千个线程来实现高吞吐量。为了充分利用硬件,现代工作负载通过启动大型任务来批量地将工作生成到GPU,其中每个任务是一个包含数千个线程的内核,占用整个GPU。gpu面临严重的利用率不足,如果任务很窄,即它们包含的线程少于512个,那么它们的性能优势就会消失。网络、信号和图像处理中对延迟敏感的应用程序使用相对较小的输入生成大量任务,就是这种有限并行性的例子。认识到这个问题,CUDA现在允许gpu上同时执行32个任务;然而,这仍有很大的利用空间。本文介绍了宝塔,一个运行时系统,虚拟化GPU资源,使用一个类似操作系统的守护进程内核,称为MasterKernel。任务在可用时从CPU衍生到Pagoda,并由MasterKernel按照warp粒度进行调度。这种级别的控制使GPU能够在发现空闲翘曲时保持调度和执行任务,从而显着减少利用率不足。在真实硬件上的实验结果表明,与在20核CPU上运行的PThreads相比,Pagoda实现了2.44倍的几何平均加速,比CUDA-HyperQ高1.43倍,比GeMTC高1.33倍,GeMTC是最先进的运行时GPU任务调度系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信