MATCHUP: Memory Abstractions for Heap Manipulating Programs

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI:10.1145/2684746.2689073

F. Winterstein, Kermin Fleming, Hsin-Jung Yang, Samuel Bayliss, G. Constantinides

{"title":"MATCHUP: Memory Abstractions for Heap Manipulating Programs","authors":"F. Winterstein, Kermin Fleming, Hsin-Jung Yang, Samuel Bayliss, G. Constantinides","doi":"10.1145/2684746.2689073","DOIUrl":null,"url":null,"abstract":"Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an FPGA accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that generates parallel application-specific multi-scratchpad architectures including on-chip caches. Our program analysis identifies non-overlapping memory regions, supported by private scratchpads, and regions which are shared by parallel units after parallelization and which are supported by coherent scratchpads and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this work is the focus on programs using dynamic, pointer-based data structures and dynamic memory allocation which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. We demonstrate our technique with three case studies of applications using dynamically allocated data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 10x speed-up after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid scratchpad architecture.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2684746.2689073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

Abstract

Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an FPGA accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that generates parallel application-specific multi-scratchpad architectures including on-chip caches. Our program analysis identifies non-overlapping memory regions, supported by private scratchpads, and regions which are shared by parallel units after parallelization and which are supported by coherent scratchpads and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this work is the focus on programs using dynamic, pointer-based data structures and dynamic memory allocation which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. We demonstrate our technique with three case studies of applications using dynamically allocated data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 10x speed-up after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid scratchpad architecture.

查看原文本刊更多论文

匹配:堆操作程序的内存抽象

内存密集型实现通常需要访问外部片外内存，由于内存带宽限制，这会大大降低FPGA加速器的速度。在芯片上缓冲频繁重用的数据是解决此问题的常用方法，而缓存架构的优化引入了另一个复杂的设计空间。本文提出了一种高级综合(HLS)设计辅助工具，用于生成包括片上缓存在内的并行应用特定的多刮刮板架构。我们的程序分析确定了由私有scratchpad支持的非重叠内存区域，以及并行化后由并行单元共享的区域，这些区域由一致的scratchpad和同步原语支持。它还决定并行化在数据依赖性方面是否合法。这项工作的新颖之处在于关注使用动态、基于指针的数据结构和动态内存分配的程序，这些虽然在软件工程中很常见，但仍然难以分析，并且超出了迄今为止绝大多数HLS技术的范围。我们通过三个使用动态分配数据结构的应用程序案例研究来演示我们的技术，并使用Xilinx Vivado HLS作为示例HLS工具。在并行化HLS实现和插入特定于应用程序的分布式混合刮擦板架构之后，我们显示了高达10倍的速度提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量