Alamelu Sankaranarayanan, E. K. Ardestani, J. L. Briz, Jose Renau
{"title":"具有微小非相干缓存的高能效GPGPU内存层次结构","authors":"Alamelu Sankaranarayanan, E. K. Ardestani, J. L. Briz, Jose Renau","doi":"10.1109/ISLPED.2013.6629259","DOIUrl":null,"url":null,"abstract":"With progressive generations and the ever-increasing promise of computing power, GPGPUs have been quickly growing in size, and at the same time, energy consumption has become a major bottleneck for them. The first level data cache and the scratchpad memory are critical to the performance of a GPGPU, but they are extremely energy inefficient due to the large number of cores they need to serve. This problem could be mitigated by introducing a cache higher up in hierarchy that services fewer cores, but this introduces cache coherency issues that may become very significant, especially for a GPGPU with hundreds of thousands of in-flight threads. In this paper, we propose adding incoherent tinyCaches between each lane in an SM, and the first level data cache that is currently shared by all the lanes in an SM. In a normal multiprocessor, this would require hardware cache coherence between all the SM lanes capable of handling hundreds of thousands of threads. Our incoherent tinyCache architecture exploits certain unique features of the CUDA/OpenCL programming model to avoid complex coherence schemes. This tinyCache is able to filter out 62% of memory requests that would otherwise need to be serviced by the DL1G, and almost 81% of scratchpad memory requests, allowing us to achieve a 37% energy reduction in the on-chip memory hierarchy. We evaluate the tinyCache for different memory patterns and show that it is beneficial in most cases.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"An energy efficient GPGPU memory hierarchy with tiny incoherent caches\",\"authors\":\"Alamelu Sankaranarayanan, E. K. Ardestani, J. L. Briz, Jose Renau\",\"doi\":\"10.1109/ISLPED.2013.6629259\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With progressive generations and the ever-increasing promise of computing power, GPGPUs have been quickly growing in size, and at the same time, energy consumption has become a major bottleneck for them. The first level data cache and the scratchpad memory are critical to the performance of a GPGPU, but they are extremely energy inefficient due to the large number of cores they need to serve. This problem could be mitigated by introducing a cache higher up in hierarchy that services fewer cores, but this introduces cache coherency issues that may become very significant, especially for a GPGPU with hundreds of thousands of in-flight threads. In this paper, we propose adding incoherent tinyCaches between each lane in an SM, and the first level data cache that is currently shared by all the lanes in an SM. In a normal multiprocessor, this would require hardware cache coherence between all the SM lanes capable of handling hundreds of thousands of threads. Our incoherent tinyCache architecture exploits certain unique features of the CUDA/OpenCL programming model to avoid complex coherence schemes. This tinyCache is able to filter out 62% of memory requests that would otherwise need to be serviced by the DL1G, and almost 81% of scratchpad memory requests, allowing us to achieve a 37% energy reduction in the on-chip memory hierarchy. We evaluate the tinyCache for different memory patterns and show that it is beneficial in most cases.\",\"PeriodicalId\":20456,\"journal\":{\"name\":\"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISLPED.2013.6629259\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISLPED.2013.6629259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An energy efficient GPGPU memory hierarchy with tiny incoherent caches
With progressive generations and the ever-increasing promise of computing power, GPGPUs have been quickly growing in size, and at the same time, energy consumption has become a major bottleneck for them. The first level data cache and the scratchpad memory are critical to the performance of a GPGPU, but they are extremely energy inefficient due to the large number of cores they need to serve. This problem could be mitigated by introducing a cache higher up in hierarchy that services fewer cores, but this introduces cache coherency issues that may become very significant, especially for a GPGPU with hundreds of thousands of in-flight threads. In this paper, we propose adding incoherent tinyCaches between each lane in an SM, and the first level data cache that is currently shared by all the lanes in an SM. In a normal multiprocessor, this would require hardware cache coherence between all the SM lanes capable of handling hundreds of thousands of threads. Our incoherent tinyCache architecture exploits certain unique features of the CUDA/OpenCL programming model to avoid complex coherence schemes. This tinyCache is able to filter out 62% of memory requests that would otherwise need to be serviced by the DL1G, and almost 81% of scratchpad memory requests, allowing us to achieve a 37% energy reduction in the on-chip memory hierarchy. We evaluate the tinyCache for different memory patterns and show that it is beneficial in most cases.