ACM Transactions on Reconfigurable Technology and Systems (TRETS)最新文献

Highly Parallel Multi-FPGA System Compilation from Sequential C/C++ Code in the AWS Cloud 基于顺序C/ c++代码的AWS云中高度并行多fpga系统编译

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-05-15 DOI: 10.1145/3507698

K. Ebcioglu, Ismail San

{"title":"Highly Parallel Multi-FPGA System Compilation from Sequential C/C++ Code in the AWS Cloud","authors":"K. Ebcioglu, Ismail San","doi":"10.1145/3507698","DOIUrl":"https://doi.org/10.1145/3507698","url":null,"abstract":"We present a High Level Synthesis compiler that automatically obtains a multi-chip accelerator system from a single-threaded sequential C/C++ application. Invoking the multi-chip accelerator is functionally identical to invoking the single-threaded sequential code the multi-chip accelerator is compiled from. Therefore, software development for using the multi-chip accelerator hardware is simplified, but the multi-chip accelerator can exhibit extremely high parallelism. We have implemented, tested, and verified our push-button system design model on multiple field-programmable gate arrays (FPGAs) of the Amazon Web Services EC2 F1 instances platform, using, as an example, a sequential-natured DES key search application that does not have any DOALL loops and that tries each candidate key in order and stops as soon as a correct key is found. An 8- FPGA accelerator produced by our compiler achieves 44,600 times better performance than an x86 Xeon CPU executing the sequential single-threaded C program the accelerator was compiled from. New features of our compiler system include: an ability to parallelize outer loops with loop-carried control dependences, an ability to pipeline an outer loop without fully unrolling its inner loops, and fully automated deployment, execution and termination of multi-FPGA application-specific accelerators in the AWS cloud, without requiring any manual steps.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115607442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FPGA Architecture Exploration for DNN Acceleration DNN加速的FPGA架构探索

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-05-10 DOI: 10.1145/3503465

Esther Roorda, Seyedramin Rasoulinezhad, Philip H. W. Leong, S. Wilton

{"title":"FPGA Architecture Exploration for DNN Acceleration","authors":"Esther Roorda, Seyedramin Rasoulinezhad, Philip H. W. Leong, S. Wilton","doi":"10.1145/3503465","DOIUrl":"https://doi.org/10.1145/3503465","url":null,"abstract":"Recent years have seen an explosion of machine learning applications implemented on Field-Programmable Gate Arrays (FPGAs). FPGA vendors and researchers have responded by updating their fabrics to more efficiently implement machine learning accelerators, including innovations such as enhanced Digital Signal Processing (DSP) blocks and hardened systolic arrays. Evaluating architectural proposals is difficult, however, due to the lack of publicly available benchmark circuits. This paper addresses this problem by presenting an open-source benchmark circuit generator that creates realistic DNN-oriented circuits for use in FPGA architecture studies. Unlike previous generators, which create circuits that are agnostic of the underlying FPGA, our circuits explicitly instantiate embedded blocks, allowing for meaningful comparison of recent architectural proposals without the need for a complete inference computer-aided design (CAD) flow. Our circuits are compatible with the VTR CAD suite, allowing for architecture studies that investigate routing congestion and other low-level architectural implications. In addition to addressing the lack of machine learning benchmark circuits, the architecture exploration flow that we propose allows for a more comprehensive evaluation of FPGA architectures than traditional static benchmark suites. We demonstrate this through three case studies which illustrate how realistic benchmark circuits can be generated to target different heterogeneous FPGAs.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128766138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Introduction to Special Issue on FPGAs in Data Centers, Part II 数据中心fpga专题介绍，第二部分

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-05-10 DOI: 10.1145/3495231

Ken Eguro, S. Neuendorffer, V. Prasanna, Hongbo Rong

引用次数: 0

A BNN Accelerator Based on Edge-skip-calculation Strategy and Consolidation Compressed Tree 基于边跳计算策略和压缩树的BNN加速器

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-05-10 DOI: 10.1145/3494569

Gaoming Du, Bangyi Chen, Zhenmin Li, Zhenxing Tu, Junjie Zhou, Shenya Wang, Qinghao Zhao, Yong-Sheng Yin, Xiaolei Wang

{"title":"A BNN Accelerator Based on Edge-skip-calculation Strategy and Consolidation Compressed Tree","authors":"Gaoming Du, Bangyi Chen, Zhenmin Li, Zhenxing Tu, Junjie Zhou, Shenya Wang, Qinghao Zhao, Yong-Sheng Yin, Xiaolei Wang","doi":"10.1145/3494569","DOIUrl":"https://doi.org/10.1145/3494569","url":null,"abstract":"Binarized neural networks (BNNs) and batch normalization (BN) have already become typical techniques in artificial intelligence today. Unfortunately, the massive accumulation and multiplication in BNN models bring challenges to field-programmable gate array (FPGA) implementations, because complex arithmetics in BN consume too much computing resources. To relax FPGA resource limitations and speed up the computing process, we propose a BNN accelerator architecture based on consolidation compressed tree scheme by combining both XNOR and accumulation operation of the low bit into a systematic one. During the compression process, we adopt 0-padding (not ±1) to achieve no-accuracy-loss from software modeling to hardware implementation. Moreover, we introduce shift-addition-BN free binarization technique to shorten the delay path and optimize on-chip storage. To sum up, we drastically cut down the hardware consumption while maintaining great speed performance with the same model complexity as the previous design. We evaluate our accelerator on MNIST and CIFAR-10 dataset and implement the whole system on the ARTIX-7 100T FPGA with speed performance of 2052.65 GOP/s and area efficiency of 70.15 GOPS/KLUT.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116529347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FPGA HLS Today: Successes, Challenges, and Opportunities FPGA HLS的今天:成功、挑战和机遇

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-04-21 DOI: 10.1145/3530775

J. Cong, Jason Lau, Gai Liu, S. Neuendorffer, P. Pan, K. Vissers, Zhiru Zhang

引用次数: 30

Tensor Slices: FPGA Building Blocks For The Deep Learning Era

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-04-13 DOI: 10.1145/3529650

Aman Arora, Moinak Ghosh, Samidh Mehta, Vaughn Betz, L. John

{"title":"Tensor Slices: FPGA Building Blocks For The Deep Learning Era","authors":"Aman Arora, Moinak Ghosh, Samidh Mehta, Vaughn Betz, L. John","doi":"10.1145/3529650","DOIUrl":"https://doi.org/10.1145/3529650","url":null,"abstract":"FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for smaller precision math like 8-bit fixed point and IEEE half-precision (fp16) in DSP slices, adding shadow multipliers in logic blocks, etc. In this paper, we describe replacing a portion of the FPGA’s programmable logic area with Tensor Slices. These slices have a systolic array of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual multipliers and MACs (multiply-and-accumulate). These slices have a local crossbar at the inputs that helps with easing the routing pressure caused by a large block on the FPGA. Adding these DL-specific coarse-grained hard blocks to FPGAs increases their compute density and makes them even better hardware accelerators for DL applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126250669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Stratix 10 NX Architecture Stratix 10nx Architecture

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-03-14 DOI: 10.1145/3520197

M. Langhammer, E. Nurvitadhi, Sergey Gribok, B. Pasca

{"title":"Stratix 10 NX Architecture","authors":"M. Langhammer, E. Nurvitadhi, Sergey Gribok, B. Pasca","doi":"10.1145/3520197","DOIUrl":"https://doi.org/10.1145/3520197","url":null,"abstract":"The advent of AI has driven the exploration of high-density low-precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. Technologies outside of the FPGA realm have also evolved, such as the addition of tensor structures for GPUs, as well as the introduction of numerous AI ASSPs, all of which have a higher claimed performance and efficiency than current FPGAs. In this article, we will introduce the Stratix 10 NX device, which is a variant of FPGA specifically optimized for the AI application space. In addition to the computational capabilities of the standard programmable soft-logic fabric, a new type of DSP Block provides the dense arrays of low-precision multipliers typically used in AI implementations. The architecture of the block is tuned for the common matrix-matrix or vector-matrix multiplications in AI, with capabilities designed to work efficiently for both small and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent support to support block FP16 and block FP12 numerics. All additions/accumulations can be done in INT32 or IEEE-754 single precision floating point (FP32), and multiple blocks can be cascaded together to support larger matrices. We will also describe methods by which the smaller precision multipliers can be aggregated to create larger multipliers that are more applicable to standard signal processing requirements. In the AI market, the FPGA must compete directly with other types of devices, rather than occupy a unique niche. Deterministic system performance is as important as the performance of individual FPGA elements, such as logic, memory, and DSP. We will show that the feed forward datapath structures that are needed to support the typical AI matrix-vector and matrix-matrix multiplication operations can consistently close timing at over 500 MHz on a mid-speed grade device, even if all of the Tensor Blocks on the device are used. We will also show a full-chip NPU processor implementation that out performs GPUs at the same process node for a variety of AI inferencing workloads, even though it has a lower operating frequency of 365 MHz. In terms of overall compute throughput, Stratix 10 NX is specified at 143 INT8/FP16 TOPs/FLOPs or 286 INT4/FP12 TOPS/FLOPs. Depending on the configuration, power efficiency is in the range of 1–4 TOPs or TFLOPs/W.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131227861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

HopliteML: Evolving Application Customized FPGA NoCs with Adaptable Routers and Regulators HopliteML:发展应用定制FPGA noc与自适应路由器和调节器

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-02-14 DOI: 10.1145/3507699

G. Malik, Ian Elmor Lang, R. Pellizzoni, Nachiket Kapre

{"title":"HopliteML: Evolving Application Customized FPGA NoCs with Adaptable Routers and Regulators","authors":"G. Malik, Ian Elmor Lang, R. Pellizzoni, Nachiket Kapre","doi":"10.1145/3507699","DOIUrl":"https://doi.org/10.1145/3507699","url":null,"abstract":"We can overcome the pessimism in worst-case routing latency analysis of timing-predictable Network-on-Chip (NoC) workloads by single-digit factors through the use of a hybrid field-programmable gate array (FPGA)–optimized NoC and workload-adapted regulation. Timing-predictable FPGA-optimized NoCs such as HopliteBuf integrate stall-free FIFOs that are sized using offline static analysis of a user-supplied flow pattern and rates. For certain bursty traffic and flow configurations, static analysis delivers very large, sometimes infeasible, FIFO size bounds and large worst-case latency bounds. Alternatively, backpressure-based NoCs such as HopliteBP can operate with lower latencies for certain bursty flows. However, they suffer from severe pessimism in the analysis due to the effect of pipelining of packets and interleaving of flows at switch ports. As we show in this article, a hybrid FPGA NoC that seamlessly composes both design styles on a per-switch basis delivers the best of both worlds, with improved feasibility (bounded operation) and tighter latency bounds. We select the NoC switch configuration through a novel evolutionary algorithm based on Maximum Likelihood Estimation (MLE). For synthetic (RANDOM, LOCAL) and real-world (SpMV, Graph) workloads, we demonstrate ≈2–3× improvements in feasibility and ≈1–6.8× in worst-case latency while requiring an LUT cost only ≈1–1.5× larger than the cheapest HopliteBuf solution. We also deploy and verify our NoC (PL) and MLE framework (PS) on a Pynq-Z1 to adapt and reconfigure NoC switches dynamically. We can further improve a workload’s routability by learning to surgically tune regulation rates for each traffic trace to maximize available routing bandwidth. We capture critical dependency between traces by modelling the regulation space as a multivariate Gaussian distribution and learn the distribution’s parameters using Covariance Matrix Adaptation Evolution Strategy (CMA-ES). We also propose nested learning, which learns switch configurations and regulation rates in tandem. Compared with stand-alone switch learning, this symbiotic nested learning helps achieve ≈ 1.5× lower cost constrained latency, ≈ 3.1× faster individual rates, and ≈ 1.4× faster mean rates. We also evaluate improvements to vanilla NoCs’ routing using only stand-alone rate learning (no switch learning), with ≈ 1.6× lower latency across synthetic and real-world benchmarks.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114749089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Inducing Non-uniform FPGA Aging Using Configuration-based Short Circuits 基于组态的短路诱导FPGA非均匀老化

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-02-11 DOI: 10.1145/3517042

Hayden Cook, J. Arscott, B. George, Tanner Gaskin, Jeffrey B. Goeders, B. Hutchings

{"title":"Inducing Non-uniform FPGA Aging Using Configuration-based Short Circuits","authors":"Hayden Cook, J. Arscott, B. George, Tanner Gaskin, Jeffrey B. Goeders, B. Hutchings","doi":"10.1145/3517042","DOIUrl":"https://doi.org/10.1145/3517042","url":null,"abstract":"This work demonstrates a novel method of accelerating FPGA aging by configuring FPGAs to implement thousands of short circuits, resulting in high on-chip currents and temperatures. Patterns of ring oscillators are placed across the chip and are used to characterize the operating frequency of the FPGA fabric. Over the course of several months of running the short circuits on two-thirds of the reconfigurable fabric, with daily characterization of the FPGA 6 performance, we demonstrate a decrease in FPGA frequency of 8.5%. We demonstrate that this aging is induced in a non-uniform manner. The maximum slowdown outside of the shorted regions is 2.1%, or about a fourth of the maximum slowdown that is experienced inside the shorted region. In addition, we demonstrate that the slowdown is linear after the first two weeks of the experiment and is unaffected by a recovery period. Additional experiments involving short circuits are also performed to demonstrate the results of our initial experiments are repeatable. These experiments also use a more fine-grained characterization method that provides further insight into the non-uniformed nature of the aging caused by short circuits.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"211 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122453368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking 通过微基准测试为软件程序员揭开现代fpga的软和硬化存储系统的神秘面纱

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-02-09 DOI: 10.1145/3517131

Alec Lu, Zhenman Fang, Lesley Shannon

{"title":"Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking","authors":"Alec Lu, Zhenman Fang, Lesley Shannon","doi":"10.1145/3517131","DOIUrl":"https://doi.org/10.1145/3517131","url":null,"abstract":"Both modern datacenter and embedded Field Programmable Gate Arrays (FPGAs) provide great opportunities for high-performance and high-energy-efficiency computing. With the growing public availability of FPGAs from major cloud service providers such as AWS, Alibaba, and Nimbix, as well as uniform hardware accelerator development tools (such as Xilinx Vitis and Intel oneAPI) for software programmers, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS). The major goal of this article is to figure out how to efficiently access the memory system of modern datacenter and embedded FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including (1) the clock frequency of the accelerator design, (2) the number of concurrent memory access ports, (3) the data width of each port, (4) the maximum burst access length for each port, and (5) the size of consecutive data accesses. Then, we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the memory systems of datacenter FPGAs (Xilinx Alveo U200 and U280) and embedded FPGA (Xilinx ZCU104) when changing those affecting factors, and we provide insights into efficient memory access in HLS-based accelerator designs. Comparing between the typically used soft and hardened memory systems, respectively, found on datacenter and embedded FPGAs, we further summarize their unique features and discuss the effective approaches to leverage these systems. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms on datacenter FPGAs with a soft (and thus more flexible) memory system. Compared to the baseline designs, optimized designs leveraging our insights achieve about ( 3.5times ) and ( 8.5times ) speedups for the KNN and SpMV accelerators. Our final optimized KNN and SpMV designs on a Xilinx Alveo U200 FPGA fully utilize its off-chip memory bandwidth, and achieve about ( 5.6times ) and ( 3.4times ) speedups over the 24-core CPU implementations.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126423597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1