通过内存模式自动提取减少模板码的FPGA内存占用

Robert Szafarczyk, S. Nabi, W. Vanderbauwhede
{"title":"通过内存模式自动提取减少模板码的FPGA内存占用","authors":"Robert Szafarczyk, S. Nabi, W. Vanderbauwhede","doi":"10.1109/FPL57034.2022.00033","DOIUrl":null,"url":null,"abstract":"FPGAs are attractive for scientific high-performance computing due to their potential for high performance-per-Watt. Stencil codes in scientific applications are difficult to optimize on FPGAs, because of redundant, non-contiguous memory accesses to relatively low bandwidth DRAM. In this paper, we present an algorithm to aggressively reduce on-chip block RAM (BRAM) and off-chip DRAM utilisation of stencil codes running on FPGAs. The algorithm extracts memory accesses from computational pipelines and removes all redundant intermediate arrays, including those used for stencil buffering, by trading DRAM accesses for computation. The algorithm is based on rewrite-rules on a strict functional representation derived from Fortran code and generates provably correct, optimized code. Typical FPGA implementations store the stencil window in on-chip shift registers implemented in BRAMs; we use only DRAM and optimize the memory accesses instead. Our approach dramatically reduces BRAM usage so that the domain size is only limited by available DRAM. We report a drop of 78% and 18% in BRAM usage in 3-D and 2-D stencil codes compared to a manual implementation using shift registers while staying competitive in performance or even improving performance-per-Watt.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reducing FPGA Memory Footprint of Stencil Codes through Automatic Extraction of Memory Patterns\",\"authors\":\"Robert Szafarczyk, S. Nabi, W. Vanderbauwhede\",\"doi\":\"10.1109/FPL57034.2022.00033\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"FPGAs are attractive for scientific high-performance computing due to their potential for high performance-per-Watt. Stencil codes in scientific applications are difficult to optimize on FPGAs, because of redundant, non-contiguous memory accesses to relatively low bandwidth DRAM. In this paper, we present an algorithm to aggressively reduce on-chip block RAM (BRAM) and off-chip DRAM utilisation of stencil codes running on FPGAs. The algorithm extracts memory accesses from computational pipelines and removes all redundant intermediate arrays, including those used for stencil buffering, by trading DRAM accesses for computation. The algorithm is based on rewrite-rules on a strict functional representation derived from Fortran code and generates provably correct, optimized code. Typical FPGA implementations store the stencil window in on-chip shift registers implemented in BRAMs; we use only DRAM and optimize the memory accesses instead. Our approach dramatically reduces BRAM usage so that the domain size is only limited by available DRAM. We report a drop of 78% and 18% in BRAM usage in 3-D and 2-D stencil codes compared to a manual implementation using shift registers while staying competitive in performance or even improving performance-per-Watt.\",\"PeriodicalId\":380116,\"journal\":{\"name\":\"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FPL57034.2022.00033\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPL57034.2022.00033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

由于fpga具有每瓦特高性能的潜力,因此对科学高性能计算具有吸引力。科学应用中的模板代码很难在fpga上优化,因为冗余,非连续存储器访问相对较低带宽的DRAM。在本文中,我们提出了一种算法,以积极减少在fpga上运行的模板代码的片上块RAM (BRAM)和片外DRAM的利用率。该算法从计算管道中提取内存访问,并通过交换DRAM访问来消除所有冗余的中间数组,包括用于模板缓冲的那些数组。该算法是基于重写规则的严格的函数表示派生自Fortran代码,并产生可证明正确的,优化的代码。典型的FPGA实现将模板窗口存储在用bram实现的片上移位寄存器中;我们只使用DRAM并优化内存访问。我们的方法大大减少了BRAM的使用,因此域大小仅受可用DRAM的限制。我们报告说,与使用移位寄存器的手动实现相比,3d和2d模板代码中的BRAM使用量分别下降了78%和18%,同时在性能上保持竞争力,甚至提高了每瓦特性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Reducing FPGA Memory Footprint of Stencil Codes through Automatic Extraction of Memory Patterns
FPGAs are attractive for scientific high-performance computing due to their potential for high performance-per-Watt. Stencil codes in scientific applications are difficult to optimize on FPGAs, because of redundant, non-contiguous memory accesses to relatively low bandwidth DRAM. In this paper, we present an algorithm to aggressively reduce on-chip block RAM (BRAM) and off-chip DRAM utilisation of stencil codes running on FPGAs. The algorithm extracts memory accesses from computational pipelines and removes all redundant intermediate arrays, including those used for stencil buffering, by trading DRAM accesses for computation. The algorithm is based on rewrite-rules on a strict functional representation derived from Fortran code and generates provably correct, optimized code. Typical FPGA implementations store the stencil window in on-chip shift registers implemented in BRAMs; we use only DRAM and optimize the memory accesses instead. Our approach dramatically reduces BRAM usage so that the domain size is only limited by available DRAM. We report a drop of 78% and 18% in BRAM usage in 3-D and 2-D stencil codes compared to a manual implementation using shift registers while staying competitive in performance or even improving performance-per-Watt.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信