一种用于FPGA加速指针数据结构的遍历缓存框架:以Barnes-Hut n体仿真为例

2009 International Conference on Reconfigurable Computing and FPGAs Pub Date : 2009-12-09 DOI:10.1109/ReConFig.2009.68

J. Coole, J. Wernsing, G. Stitt

{"title":"一种用于FPGA加速指针数据结构的遍历缓存框架:以Barnes-Hut n体仿真为例","authors":"J. Coole, J. Wernsing, G. Stitt","doi":"10.1109/ReConFig.2009.68","DOIUrl":null,"url":null,"abstract":"Numerous studies have shown that field-programmable gate arrays (FPGAs) often achieve large speedups compared to microprocessors. However, one significant limitation of FPGAs that has prevented their use on important applications is the requirement for regular memory access patterns. Traversal caches were previously introduced to improve the performance of FPGA implementations of algorithms with irregular memory access patterns, especially those traversing pointer-based data structures. However, a significant limitation of previous traversal caches is that speedup was limited to traversals repeated frequently over time, thus preventing speedup for algorithms without repetition, even if the similarity between traversals was large. This paper presents a new framework that extends traversal caches to enable performance improvements in such cases and provides additional improvements through reduced memory accesses and parallel processing of multiple traversals. Most importantly, we show that, for algorithms with highly similar traversals, the traversal cache framework achieves approximately linear kernel speedup with additional area, thus eliminating the memory bandwidth bottleneck commonly associated with FPGAs. We evaluate the framework using a Barnes-Hut n-body simulation case study, showing application speedups ranging from 12x to 13.5x on a Virtex4 LX100 with projected speedups as high as 40x on today’s largest FPGAs.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":"60 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"A Traversal Cache Framework for FPGA Acceleration of Pointer Data Structures: A Case Study on Barnes-Hut N-body Simulation\",\"authors\":\"J. Coole, J. Wernsing, G. Stitt\",\"doi\":\"10.1109/ReConFig.2009.68\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Numerous studies have shown that field-programmable gate arrays (FPGAs) often achieve large speedups compared to microprocessors. However, one significant limitation of FPGAs that has prevented their use on important applications is the requirement for regular memory access patterns. Traversal caches were previously introduced to improve the performance of FPGA implementations of algorithms with irregular memory access patterns, especially those traversing pointer-based data structures. However, a significant limitation of previous traversal caches is that speedup was limited to traversals repeated frequently over time, thus preventing speedup for algorithms without repetition, even if the similarity between traversals was large. This paper presents a new framework that extends traversal caches to enable performance improvements in such cases and provides additional improvements through reduced memory accesses and parallel processing of multiple traversals. Most importantly, we show that, for algorithms with highly similar traversals, the traversal cache framework achieves approximately linear kernel speedup with additional area, thus eliminating the memory bandwidth bottleneck commonly associated with FPGAs. We evaluate the framework using a Barnes-Hut n-body simulation case study, showing application speedups ranging from 12x to 13.5x on a Virtex4 LX100 with projected speedups as high as 40x on today’s largest FPGAs.\",\"PeriodicalId\":325631,\"journal\":{\"name\":\"2009 International Conference on Reconfigurable Computing and FPGAs\",\"volume\":\"60 3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 International Conference on Reconfigurable Computing and FPGAs\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ReConFig.2009.68\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Reconfigurable Computing and FPGAs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReConFig.2009.68","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

大量研究表明，与微处理器相比，现场可编程门阵列(fpga)通常具有较大的速度。然而，fpga的一个重要限制阻碍了它们在重要应用程序上的使用，那就是对常规内存访问模式的要求。以前引入遍历缓存是为了提高具有不规则内存访问模式的算法的FPGA实现的性能，特别是那些遍历基于指针的数据结构的算法。然而，以前的遍历缓存的一个重要限制是，加速仅限于随着时间的推移而频繁重复的遍历，从而阻止了没有重复的算法的加速，即使遍历之间的相似性很大。本文提出了一个扩展遍历缓存的新框架，以在这种情况下实现性能改进，并通过减少内存访问和并行处理多次遍历提供额外的改进。最重要的是，我们表明，对于具有高度相似遍历的算法，遍历缓存框架实现了具有额外面积的近似线性内核加速，从而消除了通常与fpga相关的内存带宽瓶颈。我们使用Barnes-Hut n体模拟案例研究来评估该框架，显示在Virtex4 LX100上的应用程序加速范围从12倍到13.5倍，在当今最大的fpga上预计加速高达40倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Traversal Cache Framework for FPGA Acceleration of Pointer Data Structures: A Case Study on Barnes-Hut N-body Simulation

Numerous studies have shown that field-programmable gate arrays (FPGAs) often achieve large speedups compared to microprocessors. However, one significant limitation of FPGAs that has prevented their use on important applications is the requirement for regular memory access patterns. Traversal caches were previously introduced to improve the performance of FPGA implementations of algorithms with irregular memory access patterns, especially those traversing pointer-based data structures. However, a significant limitation of previous traversal caches is that speedup was limited to traversals repeated frequently over time, thus preventing speedup for algorithms without repetition, even if the similarity between traversals was large. This paper presents a new framework that extends traversal caches to enable performance improvements in such cases and provides additional improvements through reduced memory accesses and parallel processing of multiple traversals. Most importantly, we show that, for algorithms with highly similar traversals, the traversal cache framework achieves approximately linear kernel speedup with additional area, thus eliminating the memory bandwidth bottleneck commonly associated with FPGAs. We evaluate the framework using a Barnes-Hut n-body simulation case study, showing application speedups ranging from 12x to 13.5x on a Virtex4 LX100 with projected speedups as high as 40x on today’s largest FPGAs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 International Conference on Reconfigurable Computing and FPGAs

自引率

0.00%

发文量