Combining computation and communication optimizations in system synthesis for streaming applications

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI:10.1145/2554688.2554771

J. Cong, Muhuan Huang, Peng Zhang

{"title":"Combining computation and communication optimizations in system synthesis for streaming applications","authors":"J. Cong, Muhuan Huang, Peng Zhang","doi":"10.1145/2554688.2554771","DOIUrl":null,"url":null,"abstract":"Data streaming is a widely-used technique to exploit task-level parallelism in many application domains such as video processing, signal processing and wireless communication. In this paper we propose an efficient system-level synthesis flow to map streaming applications onto FPGAs with consideration of simultaneous computation and communication optimizations. The throughput of a streaming system is significantly impacted by not only the performance and number of replicas of the computation kernels, but also the buffer size allocated for the communications between kernels. In general, module selection/replication and buffer size optimization were addressed separately in previous work. Our approach combines these optimizations together in system scheduling which minimizes the area cost for both logic and memory under the required throughput constraint. We first propose an integer linear program (ILP) based solution to the combined problem which has the optimal quality of results. Then we propose an iterative algorithm which can achieve the near-optimal quality of results but has a significant improvement on the algorithm scalability for large and complex designs. The key contribution is that we have a polynomial-time algorithm for an exact schedulability checking problem and a polynomial-time algorithm to improve the system performance with better module implementation and buffer size optimization. Experimental results show that compared to the separate scheme of module select/replication and buffer size optimization, the combined optimization scheme can gain 62% area saving on average under the same performance requirements. Moreover, our heuristic can achieve 2 to 3 orders of magnitude of speed-up in runtime, with less than 10% area overhead compared to the optimal solution by ILP.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2554688.2554771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

Abstract

Data streaming is a widely-used technique to exploit task-level parallelism in many application domains such as video processing, signal processing and wireless communication. In this paper we propose an efficient system-level synthesis flow to map streaming applications onto FPGAs with consideration of simultaneous computation and communication optimizations. The throughput of a streaming system is significantly impacted by not only the performance and number of replicas of the computation kernels, but also the buffer size allocated for the communications between kernels. In general, module selection/replication and buffer size optimization were addressed separately in previous work. Our approach combines these optimizations together in system scheduling which minimizes the area cost for both logic and memory under the required throughput constraint. We first propose an integer linear program (ILP) based solution to the combined problem which has the optimal quality of results. Then we propose an iterative algorithm which can achieve the near-optimal quality of results but has a significant improvement on the algorithm scalability for large and complex designs. The key contribution is that we have a polynomial-time algorithm for an exact schedulability checking problem and a polynomial-time algorithm to improve the system performance with better module implementation and buffer size optimization. Experimental results show that compared to the separate scheme of module select/replication and buffer size optimization, the combined optimization scheme can gain 62% area saving on average under the same performance requirements. Moreover, our heuristic can achieve 2 to 3 orders of magnitude of speed-up in runtime, with less than 10% area overhead compared to the optimal solution by ILP.

查看原文本刊更多论文

流应用系统综合中计算与通信优化的结合

在视频处理、信号处理和无线通信等应用领域，数据流是一种利用任务级并行性的技术。在本文中，我们提出了一个有效的系统级合成流程，将流应用映射到fpga上，同时考虑到同时计算和通信优化。流系统的吞吐量不仅受到计算内核的性能和副本数量的显著影响，还受到内核之间通信分配的缓冲区大小的显著影响。一般来说，模块选择/复制和缓冲区大小优化在以前的工作中分别解决。我们的方法在系统调度中结合了这些优化，从而在所需的吞吐量约束下最大限度地减少了逻辑和内存的面积成本。首先提出了一种基于整数线性规划(ILP)的组合问题的最优解。然后，我们提出了一种迭代算法，该算法可以获得接近最优的结果质量，但在大型和复杂设计的算法可扩展性方面有显着提高。关键的贡献是我们有一个多项式时间算法用于精确的可调度性检查问题和一个多项式时间算法，以更好的模块实现和缓冲区大小优化来提高系统性能。实验结果表明，在相同的性能要求下，与模块选择/复制和缓冲区大小优化的单独方案相比，组合优化方案平均可节省62%的面积。此外，我们的启发式算法在运行时可以实现2到3个数量级的加速，与ILP的最优解决方案相比，面积开销不到10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

自引率

0.00%

发文量