Reducing FPGA Memory Footprint of Stencil Codes through Automatic Extraction of Memory Patterns

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI:10.1109/FPL57034.2022.00033

Robert Szafarczyk, S. Nabi, W. Vanderbauwhede

{"title":"Reducing FPGA Memory Footprint of Stencil Codes through Automatic Extraction of Memory Patterns","authors":"Robert Szafarczyk, S. Nabi, W. Vanderbauwhede","doi":"10.1109/FPL57034.2022.00033","DOIUrl":null,"url":null,"abstract":"FPGAs are attractive for scientific high-performance computing due to their potential for high performance-per-Watt. Stencil codes in scientific applications are difficult to optimize on FPGAs, because of redundant, non-contiguous memory accesses to relatively low bandwidth DRAM. In this paper, we present an algorithm to aggressively reduce on-chip block RAM (BRAM) and off-chip DRAM utilisation of stencil codes running on FPGAs. The algorithm extracts memory accesses from computational pipelines and removes all redundant intermediate arrays, including those used for stencil buffering, by trading DRAM accesses for computation. The algorithm is based on rewrite-rules on a strict functional representation derived from Fortran code and generates provably correct, optimized code. Typical FPGA implementations store the stencil window in on-chip shift registers implemented in BRAMs; we use only DRAM and optimize the memory accesses instead. Our approach dramatically reduces BRAM usage so that the domain size is only limited by available DRAM. We report a drop of 78% and 18% in BRAM usage in 3-D and 2-D stencil codes compared to a manual implementation using shift registers while staying competitive in performance or even improving performance-per-Watt.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPL57034.2022.00033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

FPGAs are attractive for scientific high-performance computing due to their potential for high performance-per-Watt. Stencil codes in scientific applications are difficult to optimize on FPGAs, because of redundant, non-contiguous memory accesses to relatively low bandwidth DRAM. In this paper, we present an algorithm to aggressively reduce on-chip block RAM (BRAM) and off-chip DRAM utilisation of stencil codes running on FPGAs. The algorithm extracts memory accesses from computational pipelines and removes all redundant intermediate arrays, including those used for stencil buffering, by trading DRAM accesses for computation. The algorithm is based on rewrite-rules on a strict functional representation derived from Fortran code and generates provably correct, optimized code. Typical FPGA implementations store the stencil window in on-chip shift registers implemented in BRAMs; we use only DRAM and optimize the memory accesses instead. Our approach dramatically reduces BRAM usage so that the domain size is only limited by available DRAM. We report a drop of 78% and 18% in BRAM usage in 3-D and 2-D stencil codes compared to a manual implementation using shift registers while staying competitive in performance or even improving performance-per-Watt.

查看原文本刊更多论文

通过内存模式自动提取减少模板码的FPGA内存占用

由于fpga具有每瓦特高性能的潜力，因此对科学高性能计算具有吸引力。科学应用中的模板代码很难在fpga上优化，因为冗余，非连续存储器访问相对较低带宽的DRAM。在本文中，我们提出了一种算法，以积极减少在fpga上运行的模板代码的片上块RAM (BRAM)和片外DRAM的利用率。该算法从计算管道中提取内存访问，并通过交换DRAM访问来消除所有冗余的中间数组，包括用于模板缓冲的那些数组。该算法是基于重写规则的严格的函数表示派生自Fortran代码，并产生可证明正确的，优化的代码。典型的FPGA实现将模板窗口存储在用bram实现的片上移位寄存器中;我们只使用DRAM并优化内存访问。我们的方法大大减少了BRAM的使用，因此域大小仅受可用DRAM的限制。我们报告说，与使用移位寄存器的手动实现相比，3d和2d模板代码中的BRAM使用量分别下降了78%和18%，同时在性能上保持竞争力，甚至提高了每瓦特性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)

自引率

0.00%

发文量