Oversubscribed Command Queues in GPUs

Sooraj Puthoor, Xulong Tang, Joseph Gross, Bradford M. Beckmann
{"title":"Oversubscribed Command Queues in GPUs","authors":"Sooraj Puthoor, Xulong Tang, Joseph Gross, Bradford M. Beckmann","doi":"10.1145/3180270.3180271","DOIUrl":null,"url":null,"abstract":"As GPUs become larger and provide an increasing number of parallel execution units, a single kernel is no longer sufficient to utilize all available resources. As a result, GPU applications are beginning to use fine-grain asynchronous kernels, which are executed in parallel and expose more concurrency. Currently, the Heterogeneous System Architecture (HSA) and Compute Unified Device Architecture (CUDA) specifications support concurrent kernel launches with the help of multiple command queues (a.k.a. HSA queues and CUDA streams, respectively). In conjunction, GPU hardware has decreased launch overheads making fine-grain kernels more attractive. Although increasing the number of command queues is good for kernel concurrency, the GPU hardware can only monitor a fixed number of queues at any given time. Therefore, if the number of command queues exceeds hardware's monitoring capability, the queues become oversubscribed and hardware has to service some of these queues sequentially. This mapping process periodically swaps between all allocated queues and limits the available concurrency to the ready kernels in the currently mapped queues. In this paper, we bring to attention the queue oversubscription challenge and demonstrate one solution, queue prioritization, which provides up to 45x speedup for NW benchmark against the baseline that swaps queues in a round-robin fashion.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th Workshop on General Purpose GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3180270.3180271","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

Abstract

As GPUs become larger and provide an increasing number of parallel execution units, a single kernel is no longer sufficient to utilize all available resources. As a result, GPU applications are beginning to use fine-grain asynchronous kernels, which are executed in parallel and expose more concurrency. Currently, the Heterogeneous System Architecture (HSA) and Compute Unified Device Architecture (CUDA) specifications support concurrent kernel launches with the help of multiple command queues (a.k.a. HSA queues and CUDA streams, respectively). In conjunction, GPU hardware has decreased launch overheads making fine-grain kernels more attractive. Although increasing the number of command queues is good for kernel concurrency, the GPU hardware can only monitor a fixed number of queues at any given time. Therefore, if the number of command queues exceeds hardware's monitoring capability, the queues become oversubscribed and hardware has to service some of these queues sequentially. This mapping process periodically swaps between all allocated queues and limits the available concurrency to the ready kernels in the currently mapped queues. In this paper, we bring to attention the queue oversubscription challenge and demonstrate one solution, queue prioritization, which provides up to 45x speedup for NW benchmark against the baseline that swaps queues in a round-robin fashion.
gpu命令队列超额订阅
随着gpu变得越来越大,并提供越来越多的并行执行单元,单个内核不再足以利用所有可用资源。因此,GPU应用程序开始使用细粒度异步内核,它们并行执行并暴露更多并发性。目前,异构系统架构(HSA)和计算统一设备架构(CUDA)规范支持在多个命令队列(分别称为HSA队列和CUDA流)的帮助下并发内核启动。同时,GPU硬件降低了运行开销,使得细粒度内核更具吸引力。尽管增加命令队列的数量有利于内核并发性,但GPU硬件在任何给定时间只能监视固定数量的队列。因此,如果命令队列的数量超过硬件的监视能力,那么队列就会被过度订阅,硬件必须依次为其中一些队列提供服务。此映射过程定期在所有已分配的队列之间进行交换,并将可用的并发性限制为当前映射队列中的就绪内核。在本文中,我们关注队列超额订阅挑战,并演示了一个解决方案,即队列优先级,它为NW基准提供了高达45倍的加速,而不是以循环方式交换队列的基线。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信