Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI:10.1109/HPCA.2018.00027

Hongwen Dai, Zhen Lin, C. Li, Chen Zhao, Fei Wang, Nanning Zheng, Huiyang Zhou

{"title":"Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls","authors":"Hongwen Dai, Zhen Lin, C. Li, Chen Zhao, Fei Wang, Nanning Zheng, Huiyang Zhou","doi":"10.1109/HPCA.2018.00027","DOIUrl":null,"url":null,"abstract":"Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2018.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

Abstract

Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.

查看原文本刊更多论文

加速GPU并发内核执行通过减少内存管道摊位

随着技术扩展的进步，图形处理单元(GPU)包含越来越多的计算资源，单个GPU内核难以充分利用庞大的GPU资源。提高资源利用率的一个解决方案是并发内核执行(CKE)。早期CKE主要针对剩余资源。然而，它不能优化资源利用，也不能提供并发内核之间的公平性。空间多任务为每个内核分配一个流多处理器(SMs)子集。虽然实现了更好的公平性，但没有解决SM中资源利用率不足的问题。因此，提出了SM内部共享，将来自不同内核的线程块分发给每个SM。然而，正如本研究所示，在sm内部共享方案中，由于内核之间的严重干扰，整体性能可能会受到影响。具体来说，由于并发内核共享内存子系统，一个内核即使是计算密集型的，也可能因为不能及时发出内存指令而挨饿。此外，由一个内核(特别是内存密集型内核)引起的严重L1 d -缓存抖动和内存管道停滞将影响其他内核，从而进一步损害整体性能。在本研究中，我们探讨了克服sm内部共享中暴露的上述问题的各种方法。我们首先强调，针对cpu提出的缓存分区技术对gpu并不有效。然后，我们提出了两种减少内存管道延迟的方法。首先是平衡并发内核的内存访问。第二种方法是限制从单个内核发出的飞行内存指令的数量。我们的评估表明，所提出的方案显著提高了两种最先进的sm内部共享方案warp - slicer和SMK的加权加速，平均分别提高了24.6%和27.2%，并且硬件开销很轻。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量