Ataberk Olgun, Fatma Bostanci, Geraldo Francisco de Oliveira Junior, Yahya Can Tugrul, Rahul Bera, Abdullah Giray Yaglikci, Hasan Hassan, Oguz Ergin, Onur Mutlu
{"title":"Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture","authors":"Ataberk Olgun, Fatma Bostanci, Geraldo Francisco de Oliveira Junior, Yahya Can Tugrul, Rahul Bera, Abdullah Giray Yaglikci, Hasan Hassan, Oguz Ergin, Onur Mutlu","doi":"10.1145/3673653","DOIUrl":null,"url":null,"abstract":"<p>Modern computing systems access data in main memory at <i>coarse granularity</i> (e.g., at 512-bit cache block granularity). Coarse-grained access leads to wasted energy because the system does <i>not</i> use all individually accessed small portions (e.g., <i>words</i>, each of which typically is 64 bits) of a cache block. In modern DRAM-based computing systems, two key coarse-grained access mechanisms lead to wasted energy: large and fixed-size (i) data transfers between DRAM and the memory controller and (ii) DRAM row activations. </p><p>We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling <i>fine-grained</i> DRAM data transfer and DRAM row activation. To retrieve only useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block’s residency in cache and: (i) transfers only the predicted words on the memory channel by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words by carefully operating physically isolated portions of DRAM rows (i.e., mats). Activating a smaller set of cells on each access relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster. </p><p>We evaluate Sectored DRAM using 41 workloads from widely-used benchmark suites. Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM’s DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM’s DRAM chip area overhead is 1.7% of the area of a modern DDR4 chip. Compared to state-of-the-art fine-grained DRAM architectures, Sectored DRAM greatly reduces DRAM energy consumption, does <i>not</i> reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% less DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture (Half-DRAM). We hope and believe that Sectored DRAM’s ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"94 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3673653","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Modern computing systems access data in main memory at coarse granularity (e.g., at 512-bit cache block granularity). Coarse-grained access leads to wasted energy because the system does not use all individually accessed small portions (e.g., words, each of which typically is 64 bits) of a cache block. In modern DRAM-based computing systems, two key coarse-grained access mechanisms lead to wasted energy: large and fixed-size (i) data transfers between DRAM and the memory controller and (ii) DRAM row activations.
We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling fine-grained DRAM data transfer and DRAM row activation. To retrieve only useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block’s residency in cache and: (i) transfers only the predicted words on the memory channel by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words by carefully operating physically isolated portions of DRAM rows (i.e., mats). Activating a smaller set of cells on each access relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster.
We evaluate Sectored DRAM using 41 workloads from widely-used benchmark suites. Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM’s DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM’s DRAM chip area overhead is 1.7% of the area of a modern DDR4 chip. Compared to state-of-the-art fine-grained DRAM architectures, Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% less DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture (Half-DRAM). We hope and believe that Sectored DRAM’s ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.
现代计算系统以粗粒度(如 512 位缓存块粒度)访问主内存中的数据。粗粒度访问会导致能源浪费,因为系统不会使用缓存块中所有单独访问的小部分(如字,每个字通常为 64 位)。在基于 DRAM 的现代计算系统中,有两种关键的粗粒度访问机制会导致能源浪费:(i) DRAM 与内存控制器之间的大型固定尺寸数据传输;(ii) DRAM 行激活。我们提出的 Sectored DRAM 是一种新型、低开销 DRAM 基板,可通过细粒度 DRAM 数据传输和 DRAM 行激活减少能源浪费。为了只从 DRAM 中检索有用的数据,Sectored DRAM 利用了这样的观察结果:由于空间位置性差,许多高速缓存块在许多工作负载中没有得到充分利用。Sectored DRAM 可预测高速缓存块中的字,这些字在高速缓存块驻留期间可能会被访问:(i) 通过为工作负载动态调整 DRAM 数据传输大小,仅在内存通道上传输预测字;(ii) 通过小心操作 DRAM 行(即垫)的物理隔离部分,激活包含预测字的较小单元集。在每次访问中激活较小的单元集可放宽 DRAM 功率交付限制,并允许内存控制器更快地调度 DRAM 访问。我们使用广泛使用的基准套件中的 41 个工作负载对 Sectored DRAM 进行了评估。与采用粗粒度 DRAM 的系统相比,Sectored DRAM 可将高内存密集型工作负载的 DRAM 能耗降低多达(平均)33%(20%),同时将其性能提高多达(平均)36%(17%)。Sectored DRAM 在节省 DRAM 能耗的同时还提高了系统性能,使整个系统的能耗节省高达 23%。分段式 DRAM 的 DRAM 芯片面积开销是现代 DDR4 芯片面积的 1.7%。与最先进的细粒度 DRAM 架构相比,分段式 DRAM 大大降低了 DRAM 能耗,不会降低 DRAM 带宽,而且可以以较低的硬件成本实现。与最先进的高性能细粒度 DRAM 架构(Half-DRAM)相比,Sectored DRAM 可提供 89% 的性能优势,DRAM 能耗降低 12%,DRAM 芯片面积减少 34%。我们希望并相信,Sectored DRAM 的理念和成果将有助于实现更高效、更高性能的内存系统。为此,我们在 https://github.com/CMU-SAFARI/Sectored-DRAM 上开放了 Sectored DRAM 的源代码。
期刊介绍:
ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.