具有微小非相干缓存的高能效GPGPU内存层次结构

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07) Pub Date : 2013-09-04 DOI:10.1109/ISLPED.2013.6629259

Alamelu Sankaranarayanan, E. K. Ardestani, J. L. Briz, Jose Renau

{"title":"具有微小非相干缓存的高能效GPGPU内存层次结构","authors":"Alamelu Sankaranarayanan, E. K. Ardestani, J. L. Briz, Jose Renau","doi":"10.1109/ISLPED.2013.6629259","DOIUrl":null,"url":null,"abstract":"With progressive generations and the ever-increasing promise of computing power, GPGPUs have been quickly growing in size, and at the same time, energy consumption has become a major bottleneck for them. The first level data cache and the scratchpad memory are critical to the performance of a GPGPU, but they are extremely energy inefficient due to the large number of cores they need to serve. This problem could be mitigated by introducing a cache higher up in hierarchy that services fewer cores, but this introduces cache coherency issues that may become very significant, especially for a GPGPU with hundreds of thousands of in-flight threads. In this paper, we propose adding incoherent tinyCaches between each lane in an SM, and the first level data cache that is currently shared by all the lanes in an SM. In a normal multiprocessor, this would require hardware cache coherence between all the SM lanes capable of handling hundreds of thousands of threads. Our incoherent tinyCache architecture exploits certain unique features of the CUDA/OpenCL programming model to avoid complex coherence schemes. This tinyCache is able to filter out 62% of memory requests that would otherwise need to be serviced by the DL1G, and almost 81% of scratchpad memory requests, allowing us to achieve a 37% energy reduction in the on-chip memory hierarchy. We evaluate the tinyCache for different memory patterns and show that it is beneficial in most cases.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"68 1","pages":"9-14"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"An energy efficient GPGPU memory hierarchy with tiny incoherent caches\",\"authors\":\"Alamelu Sankaranarayanan, E. K. Ardestani, J. L. Briz, Jose Renau\",\"doi\":\"10.1109/ISLPED.2013.6629259\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With progressive generations and the ever-increasing promise of computing power, GPGPUs have been quickly growing in size, and at the same time, energy consumption has become a major bottleneck for them. The first level data cache and the scratchpad memory are critical to the performance of a GPGPU, but they are extremely energy inefficient due to the large number of cores they need to serve. This problem could be mitigated by introducing a cache higher up in hierarchy that services fewer cores, but this introduces cache coherency issues that may become very significant, especially for a GPGPU with hundreds of thousands of in-flight threads. In this paper, we propose adding incoherent tinyCaches between each lane in an SM, and the first level data cache that is currently shared by all the lanes in an SM. In a normal multiprocessor, this would require hardware cache coherence between all the SM lanes capable of handling hundreds of thousands of threads. Our incoherent tinyCache architecture exploits certain unique features of the CUDA/OpenCL programming model to avoid complex coherence schemes. This tinyCache is able to filter out 62% of memory requests that would otherwise need to be serviced by the DL1G, and almost 81% of scratchpad memory requests, allowing us to achieve a 37% energy reduction in the on-chip memory hierarchy. We evaluate the tinyCache for different memory patterns and show that it is beneficial in most cases.\",\"PeriodicalId\":20456,\"journal\":{\"name\":\"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)\",\"volume\":\"68 1\",\"pages\":\"9-14\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISLPED.2013.6629259\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISLPED.2013.6629259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

随着一代又一代的进步和计算能力的不断提高，gpgpu的尺寸也在迅速增长，与此同时，能耗也成为其主要的瓶颈。第一级数据缓存和刮刮板存储器对GPGPU的性能至关重要，但由于需要服务大量的内核，它们的能源效率非常低。这个问题可以通过在层次结构中引入一个更高的缓存来缓解，这个缓存可以为更少的内核提供服务，但是这引入了缓存一致性问题，这可能会变得非常重要，特别是对于具有数十万个动态线程的GPGPU。在本文中，我们建议在SM的每个通道之间添加非相干的tinycache，以及在SM中当前由所有通道共享的第一级数据缓存。在普通的多处理器中，这需要能够处理数十万个线程的所有SM通道之间的硬件缓存一致性。我们的非相干tinyCache架构利用CUDA/OpenCL编程模型的某些独特功能来避免复杂的相干方案。这个tinyCache能够过滤掉62%的内存请求，否则将需要由DL1G提供服务，以及几乎81%的刮刮板内存请求，使我们能够在片上内存层次结构中实现37%的能量减少。我们对不同的内存模式评估了tinyCache，并表明它在大多数情况下是有益的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An energy efficient GPGPU memory hierarchy with tiny incoherent caches

With progressive generations and the ever-increasing promise of computing power, GPGPUs have been quickly growing in size, and at the same time, energy consumption has become a major bottleneck for them. The first level data cache and the scratchpad memory are critical to the performance of a GPGPU, but they are extremely energy inefficient due to the large number of cores they need to serve. This problem could be mitigated by introducing a cache higher up in hierarchy that services fewer cores, but this introduces cache coherency issues that may become very significant, especially for a GPGPU with hundreds of thousands of in-flight threads. In this paper, we propose adding incoherent tinyCaches between each lane in an SM, and the first level data cache that is currently shared by all the lanes in an SM. In a normal multiprocessor, this would require hardware cache coherence between all the SM lanes capable of handling hundreds of thousands of threads. Our incoherent tinyCache architecture exploits certain unique features of the CUDA/OpenCL programming model to avoid complex coherence schemes. This tinyCache is able to filter out 62% of memory requests that would otherwise need to be serviced by the DL1G, and almost 81% of scratchpad memory requests, allowing us to achieve a 37% energy reduction in the on-chip memory hierarchy. We evaluate the tinyCache for different memory patterns and show that it is beneficial in most cases.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

自引率

0.00%

发文量