Row Tables: Design Choices to Exploit Bank Locality in Multiprogram Workloads

Paula Navarro, Vicent Selfa, J. Sahuquillo, M. E. Gómez, Crispín Gómez Requena
{"title":"Row Tables: Design Choices to Exploit Bank Locality in Multiprogram Workloads","authors":"Paula Navarro, Vicent Selfa, J. Sahuquillo, M. E. Gómez, Crispín Gómez Requena","doi":"10.1109/PDP.2015.100","DOIUrl":null,"url":null,"abstract":"Main memory is a major performance bottleneck in current chip multiprocessors. Current DRAM banks latch the last accessed row in an internal buffer, namely row buffer (RB), which allows fast subsequent accesses to that row. This throughput-oriented approach was originally designed for single-thread processors and pursues to take advantage of the spatial locality that individual applications exhibit. This paper proposes row tables, a pool of row buffers shared among threads. Depending on the needs of each thread, row buffers are dynamically allocated to threads. Two design approaches are devised differing on the table location, and referred to as BRT (Bank Row Table) and CRT (Controller Row Table), which place the table at the bank, as traditionally done in existing modules, and at the memory controller side, respectively. CRT performs better than BRT in high RB locality applications (or mixes) but performs worse in poor RB locality applications since the increase in transfer times is not later amortized. A variant of CRT referred to as CRT 1/x has been devised to reduce this performance penalty. Results for a 4-core system show that, on average, BRT and CRT 1/x mechanisms save energy by 23% and 7%-16% (depending on the X value) and improve IPC by 10% and 9%-14%, respectively.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"334 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP.2015.100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Main memory is a major performance bottleneck in current chip multiprocessors. Current DRAM banks latch the last accessed row in an internal buffer, namely row buffer (RB), which allows fast subsequent accesses to that row. This throughput-oriented approach was originally designed for single-thread processors and pursues to take advantage of the spatial locality that individual applications exhibit. This paper proposes row tables, a pool of row buffers shared among threads. Depending on the needs of each thread, row buffers are dynamically allocated to threads. Two design approaches are devised differing on the table location, and referred to as BRT (Bank Row Table) and CRT (Controller Row Table), which place the table at the bank, as traditionally done in existing modules, and at the memory controller side, respectively. CRT performs better than BRT in high RB locality applications (or mixes) but performs worse in poor RB locality applications since the increase in transfer times is not later amortized. A variant of CRT referred to as CRT 1/x has been devised to reduce this performance penalty. Results for a 4-core system show that, on average, BRT and CRT 1/x mechanisms save energy by 23% and 7%-16% (depending on the X value) and improve IPC by 10% and 9%-14%, respectively.
行表:在多程序工作负载中利用银行局部性的设计选择
主存是当前芯片多处理器的主要性能瓶颈。当前的DRAM库将最后访问的行锁存到内部缓冲区中,即行缓冲区(RB),它允许对该行进行快速的后续访问。这种面向吞吐量的方法最初是为单线程处理器设计的,旨在利用各个应用程序所显示的空间局部性。本文提出了行表,即线程间共享的行缓冲区池。根据每个线程的需要,将行缓冲区动态地分配给线程。在表位置上设计了两种不同的设计方法,称为BRT(银行行表)和CRT(控制器行表),它们将表分别放在银行和内存控制器端,就像传统上在现有模块中所做的那样。CRT在高RB局部性应用程序(或混合应用程序)中比BRT性能更好,但在低RB局部性应用程序中性能更差,因为传输时间的增加不会稍后摊平。CRT的一种变体称为CRT 1/x,旨在减少这种性能损失。对于一个4核系统,结果表明,BRT和CRT 1/x机制平均分别节能23%和7%-16%(取决于x值),IPC分别提高10%和9%-14%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信