自适应GPU缓存绕过

Proceedings of the 8th Workshop on General Purpose Processing using GPUs Pub Date : 2015-02-07 DOI:10.1145/2716282.2716283

Yingying Tian, Sooraj Puthoor, J. Greathouse, Bradford M. Beckmann, Daniel A. Jiménez

{"title":"自适应GPU缓存绕过","authors":"Yingying Tian, Sooraj Puthoor, J. Greathouse, Bradford M. Beckmann, Daniel A. Jiménez","doi":"10.1145/2716282.2716283","DOIUrl":null,"url":null,"abstract":"Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache hit rates. This problem is exacerbated by the design of current GPUs, which share small caches be- tween many threads. Caching these streaming data struc- tures needlessly burns power while evicting data that may otherwise fit into the cache. We propose a GPU cache management technique to im- prove the efficiency of small GPU caches while further re- ducing their power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted. This technique saves energy by avoid- ing needless insertions and evictions while avoiding cache pollution, resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves sim- ilar performance to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%. The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program scratchpad memories is impractical.","PeriodicalId":432610,"journal":{"name":"Proceedings of the 8th Workshop on General Purpose Processing using GPUs","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"67","resultStr":"{\"title\":\"Adaptive GPU cache bypassing\",\"authors\":\"Yingying Tian, Sooraj Puthoor, J. Greathouse, Bradford M. Beckmann, Daniel A. Jiménez\",\"doi\":\"10.1145/2716282.2716283\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache hit rates. This problem is exacerbated by the design of current GPUs, which share small caches be- tween many threads. Caching these streaming data struc- tures needlessly burns power while evicting data that may otherwise fit into the cache. We propose a GPU cache management technique to im- prove the efficiency of small GPU caches while further re- ducing their power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted. This technique saves energy by avoid- ing needless insertions and evictions while avoiding cache pollution, resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves sim- ilar performance to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%. The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program scratchpad memories is impractical.\",\"PeriodicalId\":432610,\"journal\":{\"name\":\"Proceedings of the 8th Workshop on General Purpose Processing using GPUs\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-02-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"67\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 8th Workshop on General Purpose Processing using GPUs\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2716282.2716283\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th Workshop on General Purpose Processing using GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2716282.2716283","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 67

摘要

现代图形处理单元(gpu)包括硬件控制的缓存，以减少带宽要求和能源消耗。然而，当前的GPU缓存层次结构对于通用GPU (GPGPU)计算是低效的。GPGPU工作负载倾向于包含不适合任何合理大小的缓存的数据结构，从而导致非常低的缓存命中率。当前gpu的设计加剧了这个问题，它在许多线程之间共享小缓存。缓存这些流数据结构不必要地消耗能量，同时驱逐可能适合缓存的数据。我们提出了一种GPU高速缓存管理技术，以提高小型GPU高速缓存的效率，同时进一步降低其功耗。它自适应绕过GPU缓存的块是不太可能被再次引用之前被驱逐。这种技术通过避免不必要的插入和删除来节省能源，同时避免缓存污染，从而获得更好的性能。我们表明，对于16KB L1数据缓存，动态旁路实现了与双倍L1缓存相似的性能，同时降低了25%的能耗和18%的功耗。这种技术对于不使用程序员管理的临时存储器的程序来说特别有趣。我们给出了一个案例研究，以证明与程序员管理的刮记板存储器相比，当前GPU缓存的效率低下，并展示了在编程刮记板存储器的努力不切实际的情况下，缓存绕过可以弥补潜在性能损失的程度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Adaptive GPU cache bypassing

Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache hit rates. This problem is exacerbated by the design of current GPUs, which share small caches be- tween many threads. Caching these streaming data struc- tures needlessly burns power while evicting data that may otherwise fit into the cache. We propose a GPU cache management technique to im- prove the efficiency of small GPU caches while further re- ducing their power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted. This technique saves energy by avoid- ing needless insertions and evictions while avoiding cache pollution, resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves sim- ilar performance to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%. The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program scratchpad memories is impractical.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 8th Workshop on General Purpose Processing using GPUs

自引率

0.00%

发文量