Using buffer-to-BRAM mapping approaches to trade-off throughput vs. memory use

2014 24th International Conference on Field Programmable Logic and Applications (FPL) Pub Date : 2014-10-20 DOI:10.1109/FPL.2014.6927469

Jasmina Vasiljevic, P. Chow

{"title":"Using buffer-to-BRAM mapping approaches to trade-off throughput vs. memory use","authors":"Jasmina Vasiljevic, P. Chow","doi":"10.1109/FPL.2014.6927469","DOIUrl":null,"url":null,"abstract":"One of the challenges in designing high-performance FPGA applications is fine-tuning the use of limited on-chip memory storage among many buffers in an application. To achieve desired performance and meet the on-chip memory budget requirements, the designer faces the burden of manually assigning application buffers to physical on-chip memories. Mismatches between dimensions (bit-width and depth) of buffers and physical on-chip memories lead to underutilized memories. Memory utilization can be increased via buffer packing - grouping buffers together and implementing them as a single memory, at the expense of data throughput. However, identifying buffer groups that result in the least amount of physical memory is a combinatorial problem with a large search space. This process is time consuming and non-trivial, particularly with a large number of buffers of various depths and bit widths. Previous work [1] introduced a tool that provides high-level pragmas allowing the user to specify global memory requirements, such as an application's on-chip memory budget and data throughput. This paper extends the previous work by introducing two low-level pragmas that specify information about memory access patterns, resulting in an improved on-chip memory utilization up to 22%. Further, we develop a simulated annealing based buffer packing algorithm, which reduces the tool's run-time from over 30 mins down to 15 sec, with an improvement in performance in the generated memory solution. Finally, we demonstrate the effectiveness of our tool with four stream application benchmarks.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPL.2014.6927469","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

One of the challenges in designing high-performance FPGA applications is fine-tuning the use of limited on-chip memory storage among many buffers in an application. To achieve desired performance and meet the on-chip memory budget requirements, the designer faces the burden of manually assigning application buffers to physical on-chip memories. Mismatches between dimensions (bit-width and depth) of buffers and physical on-chip memories lead to underutilized memories. Memory utilization can be increased via buffer packing - grouping buffers together and implementing them as a single memory, at the expense of data throughput. However, identifying buffer groups that result in the least amount of physical memory is a combinatorial problem with a large search space. This process is time consuming and non-trivial, particularly with a large number of buffers of various depths and bit widths. Previous work [1] introduced a tool that provides high-level pragmas allowing the user to specify global memory requirements, such as an application's on-chip memory budget and data throughput. This paper extends the previous work by introducing two low-level pragmas that specify information about memory access patterns, resulting in an improved on-chip memory utilization up to 22%. Further, we develop a simulated annealing based buffer packing algorithm, which reduces the tool's run-time from over 30 mins down to 15 sec, with an improvement in performance in the generated memory solution. Finally, we demonstrate the effectiveness of our tool with four stream application benchmarks.

查看原文本刊更多论文

使用缓冲区到ram映射方法来权衡吞吐量与内存使用

设计高性能FPGA应用程序的挑战之一是在应用程序中的许多缓冲区中微调有限的片上存储器存储的使用。为了达到预期的性能并满足片上内存预算要求，设计人员面临着手动将应用程序缓冲区分配给物理片上内存的负担。缓冲区的尺寸(位宽和深度)与片上物理内存之间的不匹配导致内存未被充分利用。内存利用率可以通过缓冲区打包(将缓冲区分组在一起并将它们实现为单个内存)来提高，但要牺牲数据吞吐量。然而，识别导致最少物理内存的缓冲组是一个具有大搜索空间的组合问题。这个过程非常耗时，而且非常重要，特别是有大量不同深度和位宽度的缓冲区时。先前的工作[1]介绍了一种工具，该工具提供高级pragmas，允许用户指定全局内存需求，例如应用程序的片上内存预算和数据吞吐量。本文通过引入两个指定内存访问模式信息的低级编程扩展了之前的工作，从而将片上内存利用率提高到22%。此外，我们开发了一种基于模拟退火的缓冲打包算法，该算法将工具的运行时间从30多分钟减少到15秒，并提高了生成内存解决方案的性能。最后，我们用四个流应用程序基准测试来演示我们的工具的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 24th International Conference on Field Programmable Logic and Applications (FPL)

自引率

0.00%

发文量