Yunru Bai, Abigail Fuentes-Rivera, Mike Riera, Mohammed Alawad, Mingjie Lin
{"title":"利用动态优先图提高多核FPGA器件的内存性能","authors":"Yunru Bai, Abigail Fuentes-Rivera, Mike Riera, Mohammed Alawad, Mingjie Lin","doi":"10.1109/FCCM.2013.39","DOIUrl":null,"url":null,"abstract":"Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures [1]. Being “memory-centric”, the DLM architecture can significantly improve the overall performance and energy efficiency of many memory-intensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis. We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmic-centric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to [1] (A2). To make our comparison fair, in all three architectures, the data path remains the same while local memory architecture differs. For each of ten benchmark applications from SPEC2006 and MiBench [2], we break down the performance benefit of using A3 into two parts: the portion due to stochastic local memory partitioning and the portion due to the dynamic graph-based memory arbitration. All experiments have been conducted with a Virtex-5 (XCV5LX155T-2) FPGA. On average, our experimental results show that our proposed A3 architecture outperforms A2 and A1 by 34% and 250%, respectively. Within the performance improvement of A3 over A2, more than 70% improvement comes from the hardware graph-based memory scheduling.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Boosting Memory Performance of Many-Core FPGA Device through Dynamic Precedence Graph\",\"authors\":\"Yunru Bai, Abigail Fuentes-Rivera, Mike Riera, Mohammed Alawad, Mingjie Lin\",\"doi\":\"10.1109/FCCM.2013.39\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures [1]. Being “memory-centric”, the DLM architecture can significantly improve the overall performance and energy efficiency of many memory-intensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis. We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmic-centric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to [1] (A2). To make our comparison fair, in all three architectures, the data path remains the same while local memory architecture differs. For each of ten benchmark applications from SPEC2006 and MiBench [2], we break down the performance benefit of using A3 into two parts: the portion due to stochastic local memory partitioning and the portion due to the dynamic graph-based memory arbitration. All experiments have been conducted with a Virtex-5 (XCV5LX155T-2) FPGA. On average, our experimental results show that our proposed A3 architecture outperforms A2 and A1 by 34% and 250%, respectively. Within the performance improvement of A3 over A2, more than 70% improvement comes from the hardware graph-based memory scheduling.\",\"PeriodicalId\":269887,\"journal\":{\"name\":\"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FCCM.2013.39\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2013.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Boosting Memory Performance of Many-Core FPGA Device through Dynamic Precedence Graph
Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures [1]. Being “memory-centric”, the DLM architecture can significantly improve the overall performance and energy efficiency of many memory-intensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis. We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmic-centric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to [1] (A2). To make our comparison fair, in all three architectures, the data path remains the same while local memory architecture differs. For each of ten benchmark applications from SPEC2006 and MiBench [2], we break down the performance benefit of using A3 into two parts: the portion due to stochastic local memory partitioning and the portion due to the dynamic graph-based memory arbitration. All experiments have been conducted with a Virtex-5 (XCV5LX155T-2) FPGA. On average, our experimental results show that our proposed A3 architecture outperforms A2 and A1 by 34% and 250%, respectively. Within the performance improvement of A3 over A2, more than 70% improvement comes from the hardware graph-based memory scheduling.