2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)最新文献_第2页

Dynamic Inter-Block Scheduling for HLS HLS的动态块间调度

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00045

Jianyi Cheng, Lana Josipović, G. Constantinides, John Wickerson

{"title":"Dynamic Inter-Block Scheduling for HLS","authors":"Jianyi Cheng, Lana Josipović, G. Constantinides, John Wickerson","doi":"10.1109/FPL57034.2022.00045","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00045","url":null,"abstract":"A recent theme in HLS research is the production of dynamically scheduled circuits, which are made up of components that use handshaking to schedule themselves at run time, as opposed to following a schedule determined statically at compile time. Dynamically scheduled circuits promise superior performance on ‘irregular’ source programs, such as those whose control flow depends on input data, at the cost of additional area. Current dynamic scheduling techniques are well able to exploit parallelism among instructions within each basic block (BB) of the source program, but parallelism between BBs is underexplored. Although current tools allow the operations of different BBs to overlap, they require the BBs to start in strict program order, thus limiting the achievable parallelism and overall performance. We seek to lift this restriction. Doing so involves developing a toolflow that tackles the following challenges: (1) finding consecutive subgraphs in the control-flow graph and using static analysis to identify those subgraphs that can be safely parallelised, and (2) adapting the circuit so that those subgraphs are executed in parallel while ensuring deterministic circuit behaviour and correct usage of memory interfaces. Using two benchmark sets from related works, we compare our proposed toolflow against a state-of-the-art dynamically scheduled HLS tool called Dynamatic. Our results show that after standard loop unrolling is applied, our toolflow yields a 4 x average speedup, with a negligible area overhead. This increases to a 7.3 x average speedup when our toolflow is further combined with C-slow pipelining.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121107511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Reduction of Bitstream Size for Low-Cost iCE40 FPGAs 降低低成本iCE40 fpga的比特流大小

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00028

Clemens Fritzsch, Jörn Hoffmann, Martin Bogdan

{"title":"Reduction of Bitstream Size for Low-Cost iCE40 FPGAs","authors":"Clemens Fritzsch, Jörn Hoffmann, Martin Bogdan","doi":"10.1109/FPL57034.2022.00028","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00028","url":null,"abstract":"Reducing the bitstream size is important to lower external storage requirements and to speed-up the reconfiguration of field-programmable gate arrays (FPGAs). The most common methods for bitstream size reduction are based on dedicated hardware elements or dynamic partial reconfiguration. All of these properties are usually missing in low-cost FPGAs such as the Lattice iCE40 device family. In this paper we propose a lightweight compaction approach for iCE40 FPGAs. We present five methods for bitstream compaction: two adapted and three new. The methods work directly on the bitstream by removing unnecessary data and redundant commands. They are applicable independent of the synthesis toolchain and require neither repetition of synthesis steps nor modifications of the target system. Although our focus is on iCE40 devices, we additionally discuss the conditions for applying our approach to other targets. All five methods were implemented in an open-source compaction tool. We evaluate our approach with an iCE40 HX8K FPGA by synthesizing and compacting various projects. As a result, we achieve a reduction in bitstream size and reconfiguration time by up to 79 %.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122396161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimal Binding and Port Assignment for Loop Pipelining in High-Level Synthesis 高级合成中环路管道的最优绑定和端口分配

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00047

Nicolai Fiege, Patrick Sittel, P. Zipf

{"title":"Optimal Binding and Port Assignment for Loop Pipelining in High-Level Synthesis","authors":"Nicolai Fiege, Patrick Sittel, P. Zipf","doi":"10.1109/FPL57034.2022.00047","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00047","url":null,"abstract":"In order to provide high throughput for custom hardware implementations, academic and commercial high-level synthesis (HLS) tools use loop pipelining by modulo scheduling. When provided a resource allocation and a schedule, the binding algorithm can be used to reduce the number of required lifetime registers (LR) and multiplexers (MUX). Contrary to non-modulo schedules, optimal solutions to the binding problem for implementing modulo schedules with respect to minimizing required LRs and MUXs have not been published. To address this topic, we propose a novel optimal binding algorithm to simultaneously minimize MUX and LR costs for loop pipelining using Integer Linear Programming. We evaluated our algorithm on a set of commonly used benchmark instances from digital signal processing and report that all encountered problems could be solved, with 36.53% of the solutions being optimal within a time limit of only five minutes. Compared to worst case evaluations, we report MUX and LR savings of up to 42.74% and 26.62%, respectively. To evaluate the impact on the resulting circuit after place and route, we studied FPGA implementations of several benchmark instances and recorded look-up table and flip-flop reductions of up to 13.70% and 5.24%, respectively, compared to previous work and to an extensive set of randomly generated bindings when state-of-the-art algorithms fail to find a feasible solution.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131166718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

FPL Demo: Hot Reconfiguration - Partial Reconfiguration without Bounds FPL演示:热重构-部分重构无边界

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00084

Myrtle Shah

引用次数: 0

Modeling and Exploration of Elastic CGRAs 弹性CGRAs的建模与探索

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00067

Omar Ragheb, Tianyi Yu, David Ma, J. Anderson

{"title":"Modeling and Exploration of Elastic CGRAs","authors":"Omar Ragheb, Tianyi Yu, David Ma, J. Anderson","doi":"10.1109/FPL57034.2022.00067","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00067","url":null,"abstract":"Elastic design concepts have the potential to bring multiple benefits to coarse-grained reconfigurable arrays (CGRAs) architecture, including the ability to interface with memories, having unknown latencies, incorporate run-time variable-latency processing elements, and ease the CGRA mapping challenges of scheduling, placement and routing. However, there are overheads in terms of power, performance and area (PPA) associated with the design and implementation of elastic circuits. In this paper, we quantify these overheads in the CGRA context by first extending an open-source CGRA modelling and exploration framework (CGRA-ME) [4] to allow elastic circuit primitives (e.g. fork, join, merge, diverge, etc.) to be used when composing/modelling a CGRA architecture. We then use this new capability to “elasticize” two widely studied CGRA architectures, ADRES [11] and HyCUBE [8]. The PPA of the elastic versions of the CGRAs are compared with their traditional statically scheduled counterparts. We also evaluate the PPA “cost” of several elastic-circuit design points, such as elastic buffer length and inclusion of merge and diverge components.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129097641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A-U3D: A Unified 2D/3D CNN Accelerator on the Versal Platform for Disparity Estimation A- u3d:通用平台上用于视差估计的统一2D/3D CNN加速器

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00029

Tianyu Zhang, Dong Li, Hong Wang, Yunzhi Li, Xiang Ma, Wei Luo, Yu Wang, Yang Huang, Yi Li, Yu Zhang, Xinlin Yang, Xijie Jia, Qiang Lin, Lu Tian, Fan Jiang, Dongliang Xie, Hong Luo, Yi Shan

{"title":"A-U3D: A Unified 2D/3D CNN Accelerator on the Versal Platform for Disparity Estimation","authors":"Tianyu Zhang, Dong Li, Hong Wang, Yunzhi Li, Xiang Ma, Wei Luo, Yu Wang, Yang Huang, Yi Li, Yu Zhang, Xinlin Yang, Xijie Jia, Qiang Lin, Lu Tian, Fan Jiang, Dongliang Xie, Hong Luo, Yi Shan","doi":"10.1109/FPL57034.2022.00029","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00029","url":null,"abstract":"3-Dimensional (3D) convolutional neural networks (CNN) are widely used in the field of disparity estimation. However, 3D CNN is more computationally dense than 2D CNN due to the increase in the disparity dimension. To enable more practical applications in autonomous driving, robotics, and other scenarios on embedded devices, we propose a unified 2D/3D CNN accelerator (A-U3D) design. This design unifies 3D standard / transposed convolution into 2D standard convolution, respectively. Our processing unit can support 2D and 3D convolution in the same mode without additional structures. Based on PSMNet, a 3D-based CNN for disparity estimation, we build a heterogeneous multi-core system integrated with A-U3D in conjunction with CPU, DSP, and AI Engines on the Xilinx Versal ACAP platform. Running the pruned 8-bit model, our A-U3D system achieves 0.289s latency, which is 11.5 × faster than the state-of-the-art solution on the same platform, and reaches an end-to-end (E2E) performance of 10.1 frames per second (FPS). Our proposed system explores the feasibility of deploying 3D CNNs with large workloads on FPGA.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132471362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Precise Characterizing of FPGAs in Production Systems fpga在生产系统中的精确表征

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00080

Bardia Babaei, Dirk Koch

引用次数: 0

Optimizing Application Mapping for Multi-FPGA Systems with Multi-ejection STDM Switches 基于多弹射STDM交换机的多fpga系统应用映射优化

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00032

Kohe Ito, Ryota Yasudo, H. Amano

引用次数: 0

Increasing Flexibility of Cloud FPGA Virtualization 提高云FPGA虚拟化的灵活性

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00060

Jinjie Ruan, Yisong Chang, Ke Zhang, Kan Shi, Mingyu Chen, Yungang Bao

{"title":"Increasing Flexibility of Cloud FPGA Virtualization","authors":"Jinjie Ruan, Yisong Chang, Ke Zhang, Kan Shi, Mingyu Chen, Yungang Bao","doi":"10.1109/FPL57034.2022.00060","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00060","url":null,"abstract":"FPGA virtualization enables multiple tenants to share programmable hardware resources for application accelerations in cloud. However, such technique is still of limited usage in commercial FPGA cloud platforms, which mainly lies in: 1) absence of direct programming interfaces of the virtualized FPGA accelerators (vFPGAs) in tenants' virtual machines (VMs), 2) a fixed VM-vFPGA data movement scheme that is inadaptive to a wide range of data sizes among different applications, and 3) performance degradation due to unregulated inter-vFPGA competitions for limited shareable external resources (e.g., off-chip DRAM bandwidth). To tackle all the above issues, we propose a flexible FPGA virtualization framework and prototype an open cloud platform with ARM SoC-equipped FPGAs. Under such framework, tenants are allowed to directly initiate FPGA partial reconfiguration in isolated VMs via a direct I/O-like vFPGA device driver with as low as 20ms overhead. A hybrid data movement approach that leverages both memory-mapped I/O and DMA is also introduced in our framework to adaptively guarantee moderate VM-vFPGA bandwidth towards various data sizes. Moreover, a lightweight priority-based hardware scheduler is elaborated to monitor and dynamically allocate off-chip DRAM bandwidth among vFPGAs. Based on our preliminary infrastructure-level evaluation results, the proposed framework and the open prototyping are of significant interests to researchers looking forward to conducting further explorations in FPGA virtualization.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127785641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

BunchBloomer: Cost-Effective Bloom Filter Accelerator for Genomics Applications BunchBloomer:经济高效的Bloom过滤器加速器，用于基因组学应用

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00014

Seongyoung Kang, Tarun Sai Ganesh Nerella, Shashank Uppoor, S. Jun

{"title":"BunchBloomer: Cost-Effective Bloom Filter Accelerator for Genomics Applications","authors":"Seongyoung Kang, Tarun Sai Ganesh Nerella, Shashank Uppoor, S. Jun","doi":"10.1109/FPL57034.2022.00014","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00014","url":null,"abstract":"Bloom filters are a very important tool for many applications including genomics, where they are used as a compact data structure for counting k-mers, represent de Bruijn graphs, and more. Due to their random-access nature coupled with the large size required for genomics, Bloom filters for genomics can easily become bound by the random access performance of off-chip memory. This is especially true for accelerators such as FPGAs and GPUs, which can easily remove the computation overhead of the multiple hash functions. As a result, Bloom filter accelerators have typically focused either on small filters which can fit in fast on-chip memory, or require fast off-chip memory fabric such as Hybrid Memory Cubes. In this work, we present BunchBloomer, which improves the cost-effectiveness of FPGA Bloom filter accelerators by making better use of cheaper, lower-power DDR memory. BunchBloomer uses a multi-layer radix sorter to group table updates into bursts directed to the same 8 KiB memory region, which can be efficiently cached in on-chip memory. A single BunchBloomer device outperforms a costly 12-core server by over 2×, demonstrating an order of magnitude better power efficiency. It even achieves better power efficiency compared to published FPGA Bloom filter accelerators equipped with Hybrid Memory Cubes.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114571984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1