从递归抽象中设计吞吐量优化数组

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI:10.1109/ASAP.2010.5540753

A. Jacob, J. Buhler, R. Chamberlain

{"title":"从递归抽象中设计吞吐量优化数组","authors":"A. Jacob, J. Buhler, R. Chamberlain","doi":"10.1109/ASAP.2010.5540753","DOIUrl":null,"url":null,"abstract":"Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the array's projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2× faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2× speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4–5× faster than the currently used latency-optimized array. These novel arrays are 70–172× faster than a software baseline.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Design of throughput-optimized arrays from recurrence abstractions\",\"authors\":\"A. Jacob, J. Buhler, R. Chamberlain\",\"doi\":\"10.1109/ASAP.2010.5540753\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the array's projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2× faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2× speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4–5× faster than the currently used latency-optimized array. These novel arrays are 70–172× faster than a software baseline.\",\"PeriodicalId\":175846,\"journal\":{\"name\":\"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASAP.2010.5540753\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP.2010.5540753","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

使用特殊用途的加速器，许多与计算相关的应用程序都实现了数量级的速度提升。fpga尤其擅长实现以数组形式实现的递归方程。现有的递归方程的高级综合方法产生了一个延迟空间最优的数组。我们的目标是在大量小输入上操作的应用程序，例如生物序列数据库，其中总体吞吐量是最重要的性能衡量标准。在这项工作中，我们在多面体框架内引入了一种新的设计空间探索程序，以优化受FPGA器件面积和带宽限制的收缩阵列的吞吐量。我们的方法是通过在数组和处理元素中的多个迭代向量上管道化多个输入来利用额外的并行性。我们证明了数组的吞吐量是由数组中任意处理器所执行的迭代向量的最大数目的倒数给出的，而迭代向量的最大数目仅由数组的投影向量决定。我们已经将这一观察应用于发现Nussinov RNA折叠的新阵列。我们的吞吐量优化数组比标准延迟空间优化数组快2倍，但它使用的LUT资源减少了15%。我们通过处理器流水线实现了2倍的加速，而资源只增加了37%。我们的工具建议使用额外的阵列来交换吞吐量，并且比当前使用的延迟优化阵列快4 - 5倍。这些新颖的阵列比软件基准快70 - 172倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Design of throughput-optimized arrays from recurrence abstractions

Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the array's projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2× faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2× speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4–5× faster than the currently used latency-optimized array. These novel arrays are 70–172× faster than a software baseline.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors

自引率

0.00%

发文量