Energy-Efficient GPU L2 Cache Design Using Instruction-Level Data Locality Similarity

Jingweijia Tan, Kaige Yan, S. Song, Xin Fu
{"title":"Energy-Efficient GPU L2 Cache Design Using Instruction-Level Data Locality Similarity","authors":"Jingweijia Tan, Kaige Yan, S. Song, Xin Fu","doi":"10.1145/3408060","DOIUrl":null,"url":null,"abstract":"This article presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all of the streaming multiprocessors is not the primary performance bottleneck, but it does consume a large amount of chip energy. We observe that L2 cache is significantly underutilized by spending 95.6% of the time storing useless data. If such “dead time” on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Locality Similarity to build an energy-efficient GPU L2 Cache, named LoSCache. Specifically, LoSCache uses the data locality information from a small group of cooperative thread arrays to dynamically predict the L2-level data re-reference counts of the remaining cooperative thread arrays. After that, specific L2 cache lines can be powered off if they are predicted to be “dead” after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss. In addition, LoSCache is cost effective, independent of the scheduling policies, and compatible with the state-of-the-art L1 cache designs for additional energy savings.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"27 1","pages":"1 - 18"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3408060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

This article presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all of the streaming multiprocessors is not the primary performance bottleneck, but it does consume a large amount of chip energy. We observe that L2 cache is significantly underutilized by spending 95.6% of the time storing useless data. If such “dead time” on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Locality Similarity to build an energy-efficient GPU L2 Cache, named LoSCache. Specifically, LoSCache uses the data locality information from a small group of cooperative thread arrays to dynamically predict the L2-level data re-reference counts of the remaining cooperative thread arrays. After that, specific L2 cache lines can be powered off if they are predicted to be “dead” after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss. In addition, LoSCache is cost effective, independent of the scheduling policies, and compatible with the state-of-the-art L1 cache designs for additional energy savings.
基于指令级数据位置相似性的高效GPU L2缓存设计
本文为gpu等大规模并行、面向吞吐量的架构提供了一种新的节能缓存设计。与现代gpu上的L1数据缓存不同,所有流多处理器共享的L2缓存并不是主要的性能瓶颈,但它确实消耗了大量的芯片能量。我们观察到二级缓存的利用率明显不足,95.6%的时间用于存储无用的数据。如果识别并减少L2上的这种“死时间”,则可以大大提高L2的能源效率。幸运的是,我们发现gpu的SIMT编程模型在线程之间提供了一个独特的特性:指令级数据位置相似性,它可以用来准确地预测L2缓存块级别的数据重新引用计数。我们提出了一个简单的设计,利用这种局域相似性来构建一个节能的GPU L2缓存,命名为LoSCache。具体来说,LoSCache使用来自一小组协作线程数组的数据位置信息来动态预测剩余协作线程数组的l2级数据重引用计数。在那之后,如果特定的L2缓存线路在某些访问之后被预测为“死”,则可以关闭它们。广泛应用的实验结果表明,我们提出的设计可以显着降低L2缓存能量平均64%,而性能损失仅为0.5%。此外,LoSCache具有成本效益,独立于调度策略,并与最先进的L1缓存设计兼容,从而节省额外的能源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信