Improving GPU Performance Through Resource Sharing

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2015-03-19 DOI:10.1145/2907294.2907298

Vishwesh Jatala, Jayvant Anantpur, Amey Karkare

{"title":"Improving GPU Performance Through Resource Sharing","authors":"Vishwesh Jatala, Jayvant Anantpur, Amey Karkare","doi":"10.1145/2907294.2907298","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) consisting of Streaming Multiprocessors (SMs) achieve high throughput by running a large number of threads and context switching among them to hide execution latencies. The number of thread blocks, and hence the number of threads that can be launched on an SM, depends on the resource usage--e.g. number of registers, amount of shared memory--of the thread blocks. Since the allocation of threads to an SM is at the thread block granularity, some of the resources may not be used up completely and hence will be wasted. We propose an approach that shares the resources of SM to utilize the wasted resources by launching more thread blocks. We show the effectiveness of our approach for two resources: register sharing, and scratchpad (shared memory) sharing. We further propose optimizations to hide long execution latencies, thus reducing the number of stall cycles. We implemented our approach in GPGPU-Sim simulator and experimentally validated it on 19 applications from 4 different benchmark suites: GPGPU-Sim, Rodinia, CUDA-SDK, and Parboil. We observed that applications that underutilize register resource show a maximum improvement of 24% and an average improvement of 11% with register sharing. Similarly, the applications that underutilize scratchpad resource show a maximum improvement of 30% and an average improvement of 12.5% with scratchpad sharing. The remaining applications, which do not waste any resources, perform similar to the baseline approach.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"89 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2015-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2907294.2907298","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Graphics Processing Units (GPUs) consisting of Streaming Multiprocessors (SMs) achieve high throughput by running a large number of threads and context switching among them to hide execution latencies. The number of thread blocks, and hence the number of threads that can be launched on an SM, depends on the resource usage--e.g. number of registers, amount of shared memory--of the thread blocks. Since the allocation of threads to an SM is at the thread block granularity, some of the resources may not be used up completely and hence will be wasted. We propose an approach that shares the resources of SM to utilize the wasted resources by launching more thread blocks. We show the effectiveness of our approach for two resources: register sharing, and scratchpad (shared memory) sharing. We further propose optimizations to hide long execution latencies, thus reducing the number of stall cycles. We implemented our approach in GPGPU-Sim simulator and experimentally validated it on 19 applications from 4 different benchmark suites: GPGPU-Sim, Rodinia, CUDA-SDK, and Parboil. We observed that applications that underutilize register resource show a maximum improvement of 24% and an average improvement of 11% with register sharing. Similarly, the applications that underutilize scratchpad resource show a maximum improvement of 30% and an average improvement of 12.5% with scratchpad sharing. The remaining applications, which do not waste any resources, perform similar to the baseline approach.

查看原文本刊更多论文

通过资源共享提升GPU性能

图形处理单元(Graphics Processing Units, gpu)由流多处理器(Streaming multiprocessor, SMs)组成，通过运行大量线程并在线程之间进行上下文切换来隐藏执行延迟，从而实现高吞吐量。线程块的数量，以及因此可以在一个SM上启动的线程的数量，取决于资源的使用情况。寄存器的数量，共享内存的数量——线程块。由于向SM分配的线程是按线程块粒度分配的，因此有些资源可能没有完全用完，因此会被浪费。我们提出了一种共享SM资源的方法，通过启动更多的线程块来利用浪费的资源。我们展示了我们的方法对两种资源的有效性:寄存器共享和刮板(共享内存)共享。我们进一步提出了隐藏长执行延迟的优化，从而减少了失速周期的数量。我们在GPGPU-Sim模拟器中实现了我们的方法，并在来自4个不同基准套件(GPGPU-Sim, Rodinia, CUDA-SDK和Parboil)的19个应用程序上进行了实验验证。我们观察到，未充分利用寄存器资源的应用程序在使用寄存器共享时最大改进了24%，平均改进了11%。同样，未充分利用刮记板资源的应用程序在刮记板共享的情况下最大改进了30%，平均改进了12.5%。其余的应用程序不浪费任何资源，其执行与基线方法类似。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

自引率

0.00%

发文量