Boosting Memory Performance of Many-Core FPGA Device through Dynamic Precedence Graph

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI:10.1109/FCCM.2013.39

Yunru Bai, Abigail Fuentes-Rivera, Mike Riera, Mohammed Alawad, Mingjie Lin

{"title":"Boosting Memory Performance of Many-Core FPGA Device through Dynamic Precedence Graph","authors":"Yunru Bai, Abigail Fuentes-Rivera, Mike Riera, Mohammed Alawad, Mingjie Lin","doi":"10.1109/FCCM.2013.39","DOIUrl":null,"url":null,"abstract":"Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures [1]. Being “memory-centric”, the DLM architecture can significantly improve the overall performance and energy efficiency of many memory-intensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis. We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmic-centric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to [1] (A2). To make our comparison fair, in all three architectures, the data path remains the same while local memory architecture differs. For each of ten benchmark applications from SPEC2006 and MiBench [2], we break down the performance benefit of using A3 into two parts: the portion due to stochastic local memory partitioning and the portion due to the dynamic graph-based memory arbitration. All experiments have been conducted with a Virtex-5 (XCV5LX155T-2) FPGA. On average, our experimental results show that our proposed A3 architecture outperforms A2 and A1 by 34% and 250%, respectively. Within the performance improvement of A3 over A2, more than 70% improvement comes from the hardware graph-based memory scheduling.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2013.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures [1]. Being “memory-centric”, the DLM architecture can significantly improve the overall performance and energy efficiency of many memory-intensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis. We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmic-centric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to [1] (A2). To make our comparison fair, in all three architectures, the data path remains the same while local memory architecture differs. For each of ten benchmark applications from SPEC2006 and MiBench [2], we break down the performance benefit of using A3 into two parts: the portion due to stochastic local memory partitioning and the portion due to the dynamic graph-based memory arbitration. All experiments have been conducted with a Virtex-5 (XCV5LX155T-2) FPGA. On average, our experimental results show that our proposed A3 architecture outperforms A2 and A1 by 34% and 250%, respectively. Within the performance improvement of A3 over A2, more than 70% improvement comes from the hardware graph-based memory scheduling.

查看原文本刊更多论文

利用动态优先图提高多核FPGA器件的内存性能

新兴的FPGA器件集成了丰富的RAM块和高性能处理器内核，为有效实现单芯片分布式逻辑存储器(DLM)架构提供了前所未有的机会[1]。DLM架构“以内存为中心”，可以显著提高许多内存密集型嵌入式应用程序的整体性能和能效，特别是那些在算法级别上表现出不规则数组数据访问模式的应用程序。然而，实现DLM架构对FPGA设计者提出了独特的挑战，包括:1)组织和划分不同的片上存储器资源;2)在片上和片外存储器之间编排有效的数据传输。在本文中，我们为这两个挑战提供了我们的解决方案。具体来说，1)我们提出了一种基于模拟退火算法的随机内存分配方案。通过探索大的解空间，得到促进并行化内存访问的内存分区方案;2)我们用一个可以动态计算内存分区之间优先关系的重新配置硬件图来增强所提出的DLM架构，从而有效地利用每个应用程序的算法级内存并行性。我们针对另外两种DLM架构合成方法评估了我们的方法(A3)的有效性:一种以算法为中心的具有单个单片内存的可重构计算架构(A1)和根据[1](A2)合成的异构分布式架构。为了使我们的比较公平，在所有三种体系结构中，数据路径保持相同，而本地内存体系结构不同。对于来自SPEC2006和MiBench[2]的10个基准测试应用程序中的每一个，我们将使用A3的性能优势分为两部分:由于随机本地内存分区的部分和由于基于动态图的内存仲裁的部分。所有实验都是在Virtex-5 (XCV5LX155T-2) FPGA上进行的。平均而言，我们的实验结果表明，我们提出的A3架构比A2和A1分别高出34%和250%。在A3相对于A2的性能改进中，超过70%的改进来自基于硬件图的内存调度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines

自引率

0.00%

发文量