Yuze Chi, Licheng Guo, Jason Lau, Young-Kyu Choi, Jie Wang, Jason Cong
{"title":"Extending High-Level Synthesis for Task-Parallel Programs.","authors":"Yuze Chi, Licheng Guo, Jason Lau, Young-Kyu Choi, Jie Wang, Jason Cong","doi":"10.1109/fccm51124.2021.00032","DOIUrl":"https://doi.org/10.1109/fccm51124.2021.00032","url":null,"abstract":"<p><p>C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools do support task-parallel programs, the productivity is greatly limited ① in the code development cycle due to the poor programmability, ② in the correctness verification cycle due to restricted software simulation, and ③ in the QoR tuning cycle due to slow code generation. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, unconstrained software simulation, and fast hierarchical code generation to overcome these limitations and demonstrate how task-parallel programs can be productively supported in HLS. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly shortened by 3.2× and 6.8×, respectively. Our work is open-source at https://github.com/UCLA-VAST/tapa/.</p>","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"2021 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/fccm51124.2021.00032","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39396430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MixFX-SCORE: Heterogeneous Fixed-Point Compilation of Dataflow Computations","authors":"João Paiva, L. Rodrigues","doi":"10.1109/.62","DOIUrl":"https://doi.org/10.1109/.62","url":null,"abstract":"Mixed-precision implementation of computation can deliver area, throughput and power improvements for dataflow computations over homogeneous fixed-precision circuits without any loss in accuracy. When designing circuits for reconfigurable hardware, we can exercise independent control over bitwidth selection of each variable in the computation. However, selecting the best precision for each variable is an NP-hard problem. While traditional solutions use automated heuristics like simulated annealing or integer linear programming, they still rely on the manual formulation of resource models, which can be tedious, and potentially inaccurate due to the unpredictable interactions between different stages of the FPGA CAD flow. We develop MixFX-SCORE, an automated tool-flow based on FX-SCORE fixed-point compilation framework and simulated annealing, to address this challenge. We outsource error analysis (Gappa++) and resource model generation (Vivado HLS, Logic Synthesis, Xilinx Place-and-Route) to external tools that offer a more accurate representation of error behavior (backed by proofs) and resource usage (based on actual utilization). We demonstrate 1.1 -- 3.5x LUTs count savings, 1 -- 1.8x DSP count reductions, and 1 -- 3.9x dynamic power improvements while still satisfying the accuracy constraints when compared to homogeneous fixed-point implementations.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"248 1","pages":"206-209"},"PeriodicalIF":0.0,"publicationDate":"2013-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91395544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. C. G. N. Ewo, A. Pinna, B. Granado, M. Mbouenda, H. Fotsin
{"title":"A Hardware MPI Spawn for Distributed Multiprocessing Reconfigurable System on Chip (MP-RSoC)","authors":"R. C. G. N. Ewo, A. Pinna, B. Granado, M. Mbouenda, H. Fotsin","doi":"10.1109/FCCM.2014.73","DOIUrl":"https://doi.org/10.1109/FCCM.2014.73","url":null,"abstract":"In this paper we describe a hardware implementation of the MPI Spawn function of MPI-2 Remote Memory Access (RMA) communication library primitive, devoted to a distributed Multi Processing Reconfigurable System on Chip (MP-RSoC). This function enhances the MPI Hardware Communication Library (MPI-HCL) we realized in previous work. Designers can activate or deactivate hardware tasks on runtime, using MPI functions in MPRSoC environment. The advantages are more scalability, efficient power consumption and easier deployment of the parallel application. Our hardware primitives have been implemented and tested on a Xilinx Spartan6 FPGA board.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"7 6","pages":"238"},"PeriodicalIF":0.0,"publicationDate":"2013-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72580125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Gurumani, Jacob Tolar, Yao Chen, Yun Liang, K. Rupnow, Deming Chen
{"title":"Integrated CUDA-to-FPGA Synthesis with Network-on-Chip","authors":"S. Gurumani, Jacob Tolar, Yao Chen, Yun Liang, K. Rupnow, Deming Chen","doi":"10.1109/.12","DOIUrl":"https://doi.org/10.1109/.12","url":null,"abstract":"Data parallel languages such as CUDA and OpenCL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"83 1","pages":"21-24"},"PeriodicalIF":0.0,"publicationDate":"2009-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89411354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tian Hangpei, Gao De-yuan, Wei Wu, Fan Xiao-ya, Zhu Yian
{"title":"Improving Performance of Partial Reconfiguration Using Strategy of Virtual Deletion","authors":"Tian Hangpei, Gao De-yuan, Wei Wu, Fan Xiao-ya, Zhu Yian","doi":"10.1109/FCCM.2008.51","DOIUrl":"https://doi.org/10.1109/FCCM.2008.51","url":null,"abstract":"In a partially reconfigurable system with online placement algorithm, we try to avoid mapping some redundant tasks by caching modules on the reconfigurable area. This paper proposes an elaborate strategy named virtual deletion and a low cost board- level hardware named recycle cache to accomplish the goal. In our strategy, the record of corresponding module is deleted from placer and indexed in the recycle cache. If the module might be used by following tasks, it can be restored from reconfigurable area by recycle cache immediately, without mapping the module again. Recycle cache can shorten average configuring time of partial reconfiguration without increasing arithmetic complex and placing time of the placer. Compared with large size of local register file which cache context of modules, the recycle cycle is much smaller and cheaper. Simulation results on large random tasks sets have shown that the recycle cache can improve performance of partially reconfigurable system effectively.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"61 1","pages":"263-264"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73893248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Relocatability of Partial Bitstreams for Run-Time Reconfiguration","authors":"Tobias Becker, Wayne Luk, Peter Y. K. Cheung","doi":"10.1109/FCCM.2007.51","DOIUrl":"https://doi.org/10.1109/FCCM.2007.51","url":null,"abstract":"In the present paper the background generation and motion detection algorithms, which are of key importance for the implementation of video detection, have been presented. A modification of the background generation algorithm, essential for proper algorithm functioning at medium and high road-traffic conditions, has been proposed. Algorithm adaptation for the implementation in reprogrammable device has been also presented. PixelStream-based implementation has been successfully performed. Real-time verification on reconfigurable platform has been done.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"16 1","pages":"35-44"},"PeriodicalIF":0.0,"publicationDate":"2007-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84419060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA implementation of pipelined multiplicative division with IEEE Rounding","authors":"Ronen Goldberg, Guy Even, P. Seidel","doi":"10.1109/FCCM.2007.59","DOIUrl":"https://doi.org/10.1109/FCCM.2007.59","url":null,"abstract":"A formal methodology for automatic hardware-software partitioning and co-scheduling between the P and the FPGA has not yet been established. Current work in automatic task partitioning and scheduling for the reconfigurable systems strictly addresses the FPGA hardware, and does not take advantage of the synergy between the microprocessor and the FPGA. In this work, we consider the problem of co-scheduling task graphs on reconfigurable systems. The target systems have an execution model which allows any subtask that can run on the FPGA to also run on the microprocessor, and allows reconfigurability of the FPGA (subject to area, performance, resource, and timing constraints). In this paper, we introduce a new heuristic algorithm for such hardware/software co-scheduling, ReCoS. It will be shown that the proposed algorithm provides up to an order of magnitude improvement in scheduling and execution times when compared with hardware/software co-schedulers found in the embedded systems area, after adapting them for reconfigurable computing.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"27 1","pages":"185-196"},"PeriodicalIF":0.0,"publicationDate":"2007-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83804282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}