Energy-Efficient GPU L2 Cache Design Using Instruction-Level Data Locality Similarity

ACM Transactions on Design Automation of Electronic Systems (TODAES) Pub Date : 2020-08-18 DOI:10.1145/3408060

Jingweijia Tan, Kaige Yan, S. Song, Xin Fu

{"title":"Energy-Efficient GPU L2 Cache Design Using Instruction-Level Data Locality Similarity","authors":"Jingweijia Tan, Kaige Yan, S. Song, Xin Fu","doi":"10.1145/3408060","DOIUrl":null,"url":null,"abstract":"This article presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all of the streaming multiprocessors is not the primary performance bottleneck, but it does consume a large amount of chip energy. We observe that L2 cache is significantly underutilized by spending 95.6% of the time storing useless data. If such “dead time” on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Locality Similarity to build an energy-efficient GPU L2 Cache, named LoSCache. Specifically, LoSCache uses the data locality information from a small group of cooperative thread arrays to dynamically predict the L2-level data re-reference counts of the remaining cooperative thread arrays. After that, specific L2 cache lines can be powered off if they are predicted to be “dead” after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss. In addition, LoSCache is cost effective, independent of the scheduling policies, and compatible with the state-of-the-art L1 cache designs for additional energy savings.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"27 1","pages":"1 - 18"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3408060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This article presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all of the streaming multiprocessors is not the primary performance bottleneck, but it does consume a large amount of chip energy. We observe that L2 cache is significantly underutilized by spending 95.6% of the time storing useless data. If such “dead time” on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Locality Similarity to build an energy-efficient GPU L2 Cache, named LoSCache. Specifically, LoSCache uses the data locality information from a small group of cooperative thread arrays to dynamically predict the L2-level data re-reference counts of the remaining cooperative thread arrays. After that, specific L2 cache lines can be powered off if they are predicted to be “dead” after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss. In addition, LoSCache is cost effective, independent of the scheduling policies, and compatible with the state-of-the-art L1 cache designs for additional energy savings.

查看原文本刊更多论文

基于指令级数据位置相似性的高效GPU L2缓存设计

本文为gpu等大规模并行、面向吞吐量的架构提供了一种新的节能缓存设计。与现代gpu上的L1数据缓存不同，所有流多处理器共享的L2缓存并不是主要的性能瓶颈，但它确实消耗了大量的芯片能量。我们观察到二级缓存的利用率明显不足，95.6%的时间用于存储无用的数据。如果识别并减少L2上的这种“死时间”，则可以大大提高L2的能源效率。幸运的是，我们发现gpu的SIMT编程模型在线程之间提供了一个独特的特性:指令级数据位置相似性，它可以用来准确地预测L2缓存块级别的数据重新引用计数。我们提出了一个简单的设计，利用这种局域相似性来构建一个节能的GPU L2缓存，命名为LoSCache。具体来说，LoSCache使用来自一小组协作线程数组的数据位置信息来动态预测剩余协作线程数组的l2级数据重引用计数。在那之后，如果特定的L2缓存线路在某些访问之后被预测为“死”，则可以关闭它们。广泛应用的实验结果表明，我们提出的设计可以显着降低L2缓存能量平均64%，而性能损失仅为0.5%。此外，LoSCache具有成本效益，独立于调度策略，并与最先进的L1缓存设计兼容，从而节省额外的能源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Design Automation of Electronic Systems (TODAES)

自引率

0.00%

发文量