{"title":"Interactive Debugging at IP Block Interfaces in FPGAs","authors":"M. Merlini, Isamu Poy, P. Chow","doi":"10.1145/3431920.3439305","DOIUrl":"https://doi.org/10.1145/3431920.3439305","url":null,"abstract":"Recent developments have shown FPGAs to be effective for data centre applications, but debugging support in that environment has not evolved correspondingly. This presents an additional barrier to widespread adoption. This work proposes Debug Governors, a new open-source debugger designed for controllability and interactive debugging that can help to locate issues across multiple FPGAs. A Debug Governor can pause, log, drop, and/or inject data into any streaming interface. These operations enable single-stepping, unit testing, and interfacing with software. Hundreds of Debug Governors can fit in a single FPGA and, because they are transparent when inactive, can be left \"dormant'' in production designs. We show how Debug Governors can be used to resolve functional problems on a real FPGA, and how they can be extended to memory-mapped protocols.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128278984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AutoSA","authors":"Jie Wang, Licheng Guo, J. Cong","doi":"10.1145/3431920.3439292","DOIUrl":"https://doi.org/10.1145/3431920.3439292","url":null,"abstract":"While systolic array architectures have the potential to deliver tremendous performance, it is notoriously challenging to customize an efficient systolic array processor for a target application. Designing systolic arrays requires knowledge for both high-level characteristics of the application and low-level hardware details, thus making it a demanding and inefficient process. To relieve users from the manual iterative trial-and-error process, we present AutoSA, an end-to-end compilation framework for generating systolic arrays on FPGA. AutoSA is based on the polyhedral framework, and further incorporates a set of optimizations on different dimensions to boost performance. An efficient and comprehensive design space exploration is performed to search for high-performance designs. We have demonstrated AutoSA on a wide range of applications, on which AutoSA achieves high performance within a short amount of time. As an example, for matrix multiplication, AutoSA achieves 934 GFLOPs, 3.41 TOPs, and 6.95 TOPs in floating point, 16-bit and 8-bit integer data types on Xilinx Alveo U250.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130842950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Mohan, Oguz Atli, Onur O. Kibar, Mohammed Zackriya, Larry Pileggi, K. Mai
{"title":"Top-down Physical Design of Soft Embedded FPGA Fabrics","authors":"P. Mohan, Oguz Atli, Onur O. Kibar, Mohammed Zackriya, Larry Pileggi, K. Mai","doi":"10.1145/3431920.3439297","DOIUrl":"https://doi.org/10.1145/3431920.3439297","url":null,"abstract":"In recent years, IC reverse engineering and IC fabrication supply chain security have grown to become significant economic and security threats for designers, system integrators, and end customers. Many of the existing logic locking and obfuscation techniques have shown to be vulnerable to attack once the attacker has access to the design netlist either through reverse engineering or through an untrusted fabrication facility. We introduce soft embedded FPGA redaction, a hardware obfuscation approach that allows the designer substitute security-critical IP blocks within a design with a synthesizable eFPGA fabric. This method fully conceals the logic and the routing of the critical IP and is compatible with standard ASIC flows for easy integration and process portability. To demonstrate eFPGA redaction, we obfuscate a RISC-V control path and a GPS P-code generator. We also show that the modified netlists are resilient to SAT attacks with moderate VLSI overheads. The secure RISC-V design has 1.89x area and 2.36x delay overhead while the GPS design has 1.39x area and negligible delay overhead when implemented on an industrial 22nm FinFET CMOS process.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129690504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peipei Zhou, Jiayi Sheng, Cody Hao Yu, Peng Wei, Jie Wang, Di Wu, J. Cong
{"title":"MOCHA: Multinode Cost Optimization in Heterogeneous Clouds with Accelerators","authors":"Peipei Zhou, Jiayi Sheng, Cody Hao Yu, Peng Wei, Jie Wang, Di Wu, J. Cong","doi":"10.1145/3431920.3439304","DOIUrl":"https://doi.org/10.1145/3431920.3439304","url":null,"abstract":"FPGAs have been widely deployed in public clouds, e.g., Amazon Web Services (AWS) and Huawei Cloud. However, simply offloading accelerated kernels from CPU hosts to PCIe-based FPGAs does not guarantee out-of-pocket cost savings in a pay-as-you-go public cloud. Taking Genome Analysis Toolkit (GATK) applications as case studies, although the adoption of FPGAs reduces the overall execution time, it introduces 2.56× extra cost, due to insufficient application-level speedup by Amdahl's law. To optimize the out-of-pocket cost while keeping high speedup and throughput, we propose Mocha framework as a distributed runtime system to fully utilize the accelerator resource by accelerator sharing and CPU-FPGA partial task offloading. Evaluation results on Haplotype Caller (HTC) and Mutect2 in GATK show that on AWS, Mocha saves on the application cost by 2.82x for HTC, 1.06x for Mutect2 and on Huawei Cloud by 1.22x, 1.52x respectively than straightforward CPU-FPGA integration solution with less than 5.1% performance overhead.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127664307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Folded Integer Multiplication for FPGAs","authors":"M. Langhammer, B. Pasca","doi":"10.1145/3431920.3439299","DOIUrl":"https://doi.org/10.1145/3431920.3439299","url":null,"abstract":"Encryption - especially the key exchange algorithms such as RSA - is an increasing use-model for FPGAs, driven by the adoption of the FPGA as a SmartNIC in the datacenter. While bulk encryption such as AES maps well to generic FPGA features, the very large multipliers required for RSA are a much more difficult problem. Although FPGAs contain thousands of small integer multipliers in DSP Blocks, aggregating them into very large multipliers is very challenging because of the large amount of soft logic required - especially in the form of long adders, and the high embedded multiplier count. In this paper, we describe a large multiplier architecture that operates in a multi-cycle format and which has a linear area/throughput ratio. We show results for a 2048-bit multiplier that has a latency of 118 cycles, inputs data every 9th cycle and closes timing at 377MHz in an Intel Arria 10 FPGA, and over 400MHz in a Stratix 10. The proposed multiplier uses 1/9 of the DSP resources typically used in a 2048-bit Karatsuba implementation, showing a perfectly linear throughput to DSP-count ratio. Our proposed solution outperforms recently reported results, in either arithmetic complexity - by making use of the Karatsuba techniques, or in scheduling efficiency - embedded DSP resources are fully utilized.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131437732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Probabilistic Optimization for High-Level Synthesis","authors":"Jianyi Cheng, John Wickerson, G. Constantinides","doi":"10.1145/3431920.3439455","DOIUrl":"https://doi.org/10.1145/3431920.3439455","url":null,"abstract":"High-level synthesis (HLS) tools automatically transform a high-level program, for example in C/C++, into a low-level hardware description. A key challenge in HLS tools is scheduling, i.e. determining the start time of all the operations in the untimed program. There are three approaches to scheduling: static, dynamic and hybrid. Static scheduling has been well studied, however, statically analysing dynamic hardware behaviours is still challenging due to the unpredictability due to run-time dependencies. Existing approaches either assume the worst-case timing behaviour, which can cause significant performance loss or area overhead, or use simulation, which takes significant time to explore a sufficiently large number of program traces. In this work, we introduce a novel probabilistic model allowing HLS tools to efficiently estimate and optimize the cycle-level timing behaviour of HLS-generated hardware. Our framework offers insights to assist both hardware engineers and HLS tools when estimating and optimizing hardware performance.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131441147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ayokunle Fadamiro, Pouyan Rezaie, S. Millican, Christopher Harris
{"title":"Simulating and Evaluating a Quaternary Logic FPGA Based on Floating-gate Memories and Voltage Division","authors":"Ayokunle Fadamiro, Pouyan Rezaie, S. Millican, Christopher Harris","doi":"10.1145/3431920.3439471","DOIUrl":"https://doi.org/10.1145/3431920.3439471","url":null,"abstract":"Technology scaling cannot meet consumer demands, especially for binary circuits. Previous studies proposed addressing this with multi-valued logic (MVL) architectures, but these architectures use non-standard fabrication techniques and optimistic performance analysis. This study presents a new quaternary FPGA (QFPGA) architecture based on floating-gate memories that standard CMOS fabrication can fabricate: programming floating-gates implement a voltage divider, and these divided voltages represent one of four distinct logic values. When simulated with open-source FinFET SPICE models, the proposed architecture obtains competitive delay and power performance compared to equivalent binary and QFPGA architectures from literature. Results show the proposed QFPGA basic logic element (BLE) requires half the area and dissipates a third of the power density compared to QFPGA architectures from literature. When projecting BLE performance onto benchmark circuits, implementing circuits requires up to 55% less area and one-third the power, and the proposed architecture can operate at clock speeds up to three times faster than binary equivalents. Future studies will investigate accurate modeling of interconnects to better account for their performance impacts and will explore efficient architectures for programming MVL memories when they're used in FPGAs.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131731301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SWIFT: Small-World-based Structural Pruning to Accelerate DNN Inference on FPGA","authors":"Yufei Ma, Gokul Krishnan, Yu Cao, Le Ye, Ru Huang","doi":"10.1145/3431920.3439465","DOIUrl":"https://doi.org/10.1145/3431920.3439465","url":null,"abstract":"State-of-the-art DNN pruning approaches achieved high sparsity. However, these methods usually do not consider the intrinsic graph property of DNNs, leading to an irregular pruned network. Consequently, hardware accelerators cannot directly benefit from such pruning, suffering additional cost on indexing, control and data paths. Inspired by the observation that the brain and real-world networks follow a Small-World model, we propose a graph-based progressive structural pruning technique, SWIFT, that integrates local clusters and global sparsity in DNNs to benefit the dataflow and workload balance of the accelerators. In particular, we propose an output stationary FPGA architecture to accelerate DNN inference and integrate it with the structural sparsity by SWIFT, so that the communication and computation of clustered zero weights are eliminated. In addition, a full mesh data router is designed to adaptively direct inputs into corresponding processing elements (PEs) for different layer configurations and skipping zero operations. The proposed SWIFT is evaluated with multiple DNNs on different datasets. It achieves sparsity ratio up to 76% for CIFAR-10, 83% for CIFAR-100, 76% for the SVHN datasets. Moreover, our proposed SWIFT FPGA accelerator achieves up to 4.4× improvement in throughput for different dense networks with a marginal hardware overhead.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"31 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129954669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Khan, Asma Khan, Zainab F. Khan, L. Huang, Kun Wang, Lei He
{"title":"NPE: An FPGA-based Overlay Processor for Natural Language Processing","authors":"H. Khan, Asma Khan, Zainab F. Khan, L. Huang, Kun Wang, Lei He","doi":"10.1145/3431920.3439477","DOIUrl":"https://doi.org/10.1145/3431920.3439477","url":null,"abstract":"In recent years, transformer-based models have shown state-of-the-art results for Natural Language Processing (NLP). In particular, the introduction of the BERT language model brought with it breakthroughs in tasks such as question answering and natural language inference, advancing applications that allow humans to interact naturally with embedded devices. FPGA-based overlay processors have been shown as effective solutions for edge image and video processing applications, which mostly rely on low precision linear matrix operations. In contrast, transformer-based NLP techniques employ a variety of higher precision nonlinear operations with significantly higher frequency. We present NPE, an FPGA-based overlay processor that can efficiently execute a variety of NLP models. NPE offers software-like programmability to the end user and, unlike FPGA designs that implement specialized accelerators for each nonlinear function, can be upgraded for future NLP models without requiring reconfiguration. NPE can meet real-time conversational AI latency targets for the BERT language model with 4x lower power than CPUs and 6x lower power than GPUs. We also show NPE uses 3x fewer FPGA resources relative to comparable BERT network-specific accelerators in the literature. NPE provides a cost-effective and power-efficient FPGA-based solution for Natural Language Processing at the edge.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132589038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking","authors":"Alec Lu, Zhenman Fang, Weihua Liu, Lesley Shannon","doi":"10.1145/3431920.3439284","DOIUrl":"https://doi.org/10.1145/3431920.3439284","url":null,"abstract":"With the public availability of FPGAs from major cloud service providers like AWS, Alibaba, and Nimbix, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS). The major goal of this paper is to figure out how to efficiently access the memory system of modern datacenter FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including 1) the number of concurrent memory access ports, 2) the data width of each port, 3) the maximum burst access length for each port, and 4) the size of consecutive data accesses. Then we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the Xilinx Alveo U200 and U280 FPGA memory systems when changing those affecting factors, and provide insights into efficient memory access in HLS-based accelerator designs. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms. Compared to the baseline designs, optimized designs leveraging our insights achieve about 3.5x and 8.5x speedups for the KNN and SpMV accelerators.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"231 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134089154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}