Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)最新文献

筛选
英文 中文
Extending High-Level Synthesis for Task-Parallel Programs. 扩展任务并行程序的高级综合。
Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium) Pub Date : 2021-05-01 Epub Date: 2021-06-02 DOI: 10.1109/fccm51124.2021.00032
Yuze Chi, Licheng Guo, Jason Lau, Young-Kyu Choi, Jie Wang, Jason Cong
{"title":"Extending High-Level Synthesis for Task-Parallel Programs.","authors":"Yuze Chi,&nbsp;Licheng Guo,&nbsp;Jason Lau,&nbsp;Young-Kyu Choi,&nbsp;Jie Wang,&nbsp;Jason Cong","doi":"10.1109/fccm51124.2021.00032","DOIUrl":"https://doi.org/10.1109/fccm51124.2021.00032","url":null,"abstract":"<p><p>C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools do support task-parallel programs, the productivity is greatly limited ① in the code development cycle due to the poor programmability, ② in the correctness verification cycle due to restricted software simulation, and ③ in the QoR tuning cycle due to slow code generation. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, unconstrained software simulation, and fast hierarchical code generation to overcome these limitations and demonstrate how task-parallel programs can be productively supported in HLS. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly shortened by 3.2× and 6.8×, respectively. Our work is open-source at https://github.com/UCLA-VAST/tapa/.</p>","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"2021 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/fccm51124.2021.00032","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39396430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
MixFX-SCORE: Heterogeneous Fixed-Point Compilation of Dataflow Computations MixFX-SCORE:数据流计算的异构定点编译
João Paiva, L. Rodrigues
{"title":"MixFX-SCORE: Heterogeneous Fixed-Point Compilation of Dataflow Computations","authors":"João Paiva, L. Rodrigues","doi":"10.1109/.62","DOIUrl":"https://doi.org/10.1109/.62","url":null,"abstract":"Mixed-precision implementation of computation can deliver area, throughput and power improvements for dataflow computations over homogeneous fixed-precision circuits without any loss in accuracy. When designing circuits for reconfigurable hardware, we can exercise independent control over bitwidth selection of each variable in the computation. However, selecting the best precision for each variable is an NP-hard problem. While traditional solutions use automated heuristics like simulated annealing or integer linear programming, they still rely on the manual formulation of resource models, which can be tedious, and potentially inaccurate due to the unpredictable interactions between different stages of the FPGA CAD flow. We develop MixFX-SCORE, an automated tool-flow based on FX-SCORE fixed-point compilation framework and simulated annealing, to address this challenge. We outsource error analysis (Gappa++) and resource model generation (Vivado HLS, Logic Synthesis, Xilinx Place-and-Route) to external tools that offer a more accurate representation of error behavior (backed by proofs) and resource usage (based on actual utilization). We demonstrate 1.1 -- 3.5x LUTs count savings, 1 -- 1.8x DSP count reductions, and 1 -- 3.9x dynamic power improvements while still satisfying the accuracy constraints when compared to homogeneous fixed-point implementations.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"248 1","pages":"206-209"},"PeriodicalIF":0.0,"publicationDate":"2013-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91395544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Hardware MPI Spawn for Distributed Multiprocessing Reconfigurable System on Chip (MP-RSoC) 面向分布式多处理可重构片上系统(MP-RSoC)的硬件MPI衍生
R. C. G. N. Ewo, A. Pinna, B. Granado, M. Mbouenda, H. Fotsin
{"title":"A Hardware MPI Spawn for Distributed Multiprocessing Reconfigurable System on Chip (MP-RSoC)","authors":"R. C. G. N. Ewo, A. Pinna, B. Granado, M. Mbouenda, H. Fotsin","doi":"10.1109/FCCM.2014.73","DOIUrl":"https://doi.org/10.1109/FCCM.2014.73","url":null,"abstract":"In this paper we describe a hardware implementation of the MPI Spawn function of MPI-2 Remote Memory Access (RMA) communication library primitive, devoted to a distributed Multi Processing Reconfigurable System on Chip (MP-RSoC). This function enhances the MPI Hardware Communication Library (MPI-HCL) we realized in previous work. Designers can activate or deactivate hardware tasks on runtime, using MPI functions in MPRSoC environment. The advantages are more scalability, efficient power consumption and easier deployment of the parallel application. Our hardware primitives have been implemented and tested on a Xilinx Spartan6 FPGA board.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"7 6","pages":"238"},"PeriodicalIF":0.0,"publicationDate":"2013-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72580125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Integrated CUDA-to-FPGA Synthesis with Network-on-Chip 集成CUDA-to-FPGA合成与片上网络
S. Gurumani, Jacob Tolar, Yao Chen, Yun Liang, K. Rupnow, Deming Chen
{"title":"Integrated CUDA-to-FPGA Synthesis with Network-on-Chip","authors":"S. Gurumani, Jacob Tolar, Yao Chen, Yun Liang, K. Rupnow, Deming Chen","doi":"10.1109/.12","DOIUrl":"https://doi.org/10.1109/.12","url":null,"abstract":"Data parallel languages such as CUDA and OpenCL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"83 1","pages":"21-24"},"PeriodicalIF":0.0,"publicationDate":"2009-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89411354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Performance of Partial Reconfiguration Using Strategy of Virtual Deletion 利用虚拟删除策略提高部分重构性能
Tian Hangpei, Gao De-yuan, Wei Wu, Fan Xiao-ya, Zhu Yian
{"title":"Improving Performance of Partial Reconfiguration Using Strategy of Virtual Deletion","authors":"Tian Hangpei, Gao De-yuan, Wei Wu, Fan Xiao-ya, Zhu Yian","doi":"10.1109/FCCM.2008.51","DOIUrl":"https://doi.org/10.1109/FCCM.2008.51","url":null,"abstract":"In a partially reconfigurable system with online placement algorithm, we try to avoid mapping some redundant tasks by caching modules on the reconfigurable area. This paper proposes an elaborate strategy named virtual deletion and a low cost board- level hardware named recycle cache to accomplish the goal. In our strategy, the record of corresponding module is deleted from placer and indexed in the recycle cache. If the module might be used by following tasks, it can be restored from reconfigurable area by recycle cache immediately, without mapping the module again. Recycle cache can shorten average configuring time of partial reconfiguration without increasing arithmetic complex and placing time of the placer. Compared with large size of local register file which cache context of modules, the recycle cycle is much smaller and cheaper. Simulation results on large random tasks sets have shown that the recycle cache can improve performance of partially reconfigurable system effectively.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"61 1","pages":"263-264"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73893248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Enhancing Relocatability of Partial Bitstreams for Run-Time Reconfiguration 为运行时重构增强部分位流的可重定位性
Tobias Becker, Wayne Luk, Peter Y. K. Cheung
{"title":"Enhancing Relocatability of Partial Bitstreams for Run-Time Reconfiguration","authors":"Tobias Becker, Wayne Luk, Peter Y. K. Cheung","doi":"10.1109/FCCM.2007.51","DOIUrl":"https://doi.org/10.1109/FCCM.2007.51","url":null,"abstract":"In the present paper the background generation and motion detection algorithms, which are of key importance for the implementation of video detection, have been presented. A modification of the background generation algorithm, essential for proper algorithm functioning at medium and high road-traffic conditions, has been proposed. Algorithm adaptation for the implementation in reprogrammable device has been also presented. PixelStream-based implementation has been successfully performed. Real-time verification on reconfigurable platform has been done.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"16 1","pages":"35-44"},"PeriodicalIF":0.0,"publicationDate":"2007-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84419060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An FPGA implementation of pipelined multiplicative division with IEEE Rounding 基于IEEE舍入的流水线乘法除法的FPGA实现
Ronen Goldberg, Guy Even, P. Seidel
{"title":"An FPGA implementation of pipelined multiplicative division with IEEE Rounding","authors":"Ronen Goldberg, Guy Even, P. Seidel","doi":"10.1109/FCCM.2007.59","DOIUrl":"https://doi.org/10.1109/FCCM.2007.59","url":null,"abstract":"A formal methodology for automatic hardware-software partitioning and co-scheduling between the P and the FPGA has not yet been established. Current work in automatic task partitioning and scheduling for the reconfigurable systems strictly addresses the FPGA hardware, and does not take advantage of the synergy between the microprocessor and the FPGA. In this work, we consider the problem of co-scheduling task graphs on reconfigurable systems. The target systems have an execution model which allows any subtask that can run on the FPGA to also run on the microprocessor, and allows reconfigurability of the FPGA (subject to area, performance, resource, and timing constraints). In this paper, we introduce a new heuristic algorithm for such hardware/software co-scheduling, ReCoS. It will be shown that the proposed algorithm provides up to an order of magnitude improvement in scheduling and execution times when compared with hardware/software co-schedulers found in the embedded systems area, after adapting them for reconfigurable computing.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"27 1","pages":"185-196"},"PeriodicalIF":0.0,"publicationDate":"2007-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83804282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信