Power Efficient Sharing-Aware GPU Data Management

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI:10.1109/IPDPS.2017.106

Abdulaziz Tabbakh, M. Annavaram, Xuehai Qian

{"title":"Power Efficient Sharing-Aware GPU Data Management","authors":"Abdulaziz Tabbakh, M. Annavaram, Xuehai Qian","doi":"10.1109/IPDPS.2017.106","DOIUrl":null,"url":null,"abstract":"The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy spreads these CTAs to different streaming multiprocessor cores (SM) in a round-robin fashion. Since each SM has a private L1 cache, the shared data among CTAs are replicated across L1 caches of different SMs. Data replication reduces the effective L1 cache size which in turn increases the data movement and power consumption. The goal of this paper is to reduce data movement and increase effective cache space in GPUs. We propose a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. We further enhance the scheduler with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classifies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. Essentially, this approach increases the lifetime of shared blocks and private blocks in different cache levels. The experimental results show that the proposed scheme reduces the off-chip traffic by 19\\% which translates to an average DRAM power reduction of 10% and performance improvement of 7%.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2017.106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy spreads these CTAs to different streaming multiprocessor cores (SM) in a round-robin fashion. Since each SM has a private L1 cache, the shared data among CTAs are replicated across L1 caches of different SMs. Data replication reduces the effective L1 cache size which in turn increases the data movement and power consumption. The goal of this paper is to reduce data movement and increase effective cache space in GPUs. We propose a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. We further enhance the scheduler with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classifies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. Essentially, this approach increases the lifetime of shared blocks and private blocks in different cache levels. The experimental results show that the proposed scheme reduces the off-chip traffic by 19\% which translates to an average DRAM power reduction of 10% and performance improvement of 7%.

查看原文本刊更多论文

节能共享感知GPU数据管理

gpu中存储系统所消耗的功率占芯片总功率的很大一部分。随着线程级并行性的增加，gpu可能会更加强调缓存和内存带宽，从而加剧功耗。我们观察到GPU应用程序中的相邻并发线程数组(cta)共享相当数量的数据。然而，默认的GPU调度策略以循环方式将这些cta分散到不同的流多处理器内核(SM)。由于每个SM都有一个专用L1缓存，因此cta之间的共享数据会跨不同SMs的L1缓存进行复制。数据复制减少了有效的L1缓存大小，从而增加了数据移动和功耗。本文的目标是减少gpu中的数据移动并增加有效的缓存空间。我们提出了一个共享感知的CTA调度器，它尝试将具有数据共享的CTA分配到相同的SM，以减少跨SMs的私有L1缓存中的数据冗余存储。我们通过共享感知缓存分配和替换策略进一步增强了调度器。感知共享的缓存管理方法动态地对私有数据和共享数据进行分类。私有块被赋予更高的优先级，以便在L1缓存中停留更长时间，共享块被赋予更高的优先级，以便在L2缓存中停留更长时间。从本质上讲，这种方法增加了不同缓存级别中共享块和私有块的生命周期。实验结果表明，该方案减少了19%的片外流量，平均DRAM功耗降低了10%，性能提高了7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量