In Search of Lost Bandwidth: Extensive Reordering of DRAM Accesses on FPGA

Gabor Csordas, Mikhail Asiatici, P. Ienne
{"title":"In Search of Lost Bandwidth: Extensive Reordering of DRAM Accesses on FPGA","authors":"Gabor Csordas, Mikhail Asiatici, P. Ienne","doi":"10.1109/ICFPT47387.2019.00030","DOIUrl":null,"url":null,"abstract":"For efficient acceleration on FPGA, it is essential for external memory to match the throughput of the processing pipelines. However, the usable DRAM bandwidth decreases significantly if the access pattern causes frequent row conflicts. Memory controllers reorder DRAM commands to minimize row conflicts; however, general-purpose controllers must also minimize latency, which limits the depth of the internal queues over which reordering can occur. For latency-insensitive applications with irregular access pattern, nonblocking caches that support thousands of in-flight misses (miss-optimized memory systems) improve bandwidth utilization by reusing the same memory response to serve as many incoming requests as possible. However, they do not improve the irregularity of the access pattern sent to the memory, meaning that row conflicts will still be an issue. Sending out bursts instead of single memory requests makes the access pattern more sequential; however, realistic implementations trade high throughput for some unnecessary data in the bursts, leading to bandwidth wastage that cancels out part of the gains from regularization. In this paper, we present an alternative approach to extend the scope of DRAM row conflict minimization beyond the possibilities of general purpose DRAM controllers. We use the thousands of future memory requests that spontaneously accumulate inside the miss-optimized memory system to implement an efficient large-scale reordering mechanism. By reordering single requests instead of sending bursts, we regularize the memory access pattern in a way that increases bandwidth utilization without incurring in any data wastage. Our solution outperforms the baseline miss-optimized memory system by up to 81% and has better worst, average, and best performance than DynaBurst across 15 benchmarks and 30 architectures.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT47387.2019.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

For efficient acceleration on FPGA, it is essential for external memory to match the throughput of the processing pipelines. However, the usable DRAM bandwidth decreases significantly if the access pattern causes frequent row conflicts. Memory controllers reorder DRAM commands to minimize row conflicts; however, general-purpose controllers must also minimize latency, which limits the depth of the internal queues over which reordering can occur. For latency-insensitive applications with irregular access pattern, nonblocking caches that support thousands of in-flight misses (miss-optimized memory systems) improve bandwidth utilization by reusing the same memory response to serve as many incoming requests as possible. However, they do not improve the irregularity of the access pattern sent to the memory, meaning that row conflicts will still be an issue. Sending out bursts instead of single memory requests makes the access pattern more sequential; however, realistic implementations trade high throughput for some unnecessary data in the bursts, leading to bandwidth wastage that cancels out part of the gains from regularization. In this paper, we present an alternative approach to extend the scope of DRAM row conflict minimization beyond the possibilities of general purpose DRAM controllers. We use the thousands of future memory requests that spontaneously accumulate inside the miss-optimized memory system to implement an efficient large-scale reordering mechanism. By reordering single requests instead of sending bursts, we regularize the memory access pattern in a way that increases bandwidth utilization without incurring in any data wastage. Our solution outperforms the baseline miss-optimized memory system by up to 81% and has better worst, average, and best performance than DynaBurst across 15 benchmarks and 30 architectures.
寻找丢失的带宽:FPGA上DRAM访问的广泛重排序
为了在FPGA上实现有效的加速,外部存储器必须匹配处理管道的吞吐量。但是,如果访问模式导致频繁的行冲突,则可用的DRAM带宽会显著减少。内存控制器重新排序DRAM命令以最小化行冲突;但是,通用控制器还必须最小化延迟,这限制了内部队列的深度,以便重新排序。对于具有不规则访问模式的对延迟不敏感的应用程序,支持数千个运行中的错误(错误优化的内存系统)的非阻塞缓存通过重用相同的内存响应来服务尽可能多的传入请求来提高带宽利用率。然而,它们并不能改善发送到内存的访问模式的不规则性,这意味着行冲突仍然是一个问题。发送突发而不是单个内存请求使访问模式更具顺序性;然而,现实的实现以高吞吐量换取突发中一些不必要的数据,导致带宽浪费,抵消了正则化带来的部分收益。在本文中,我们提出了一种替代方法来扩展DRAM行冲突最小化的范围,超出了通用DRAM控制器的可能性。我们利用在未优化内存系统中自发积累的数千个未来内存请求来实现高效的大规模重排序机制。通过重新排序单个请求而不是发送突发请求,我们以一种增加带宽利用率而不导致任何数据浪费的方式规范内存访问模式。我们的解决方案的性能比基准的未优化内存系统高出81%,并且在15个基准测试和30个架构中具有比DynaBurst更好的最差、平均和最佳性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信