{"title":"Lightweight Programmable DSP Block Overlay for Streaming Neural Network Acceleration","authors":"Lenos Ioannou, Suhaib A. Fahmy","doi":"10.1109/ICFPT47387.2019.00066","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00066","url":null,"abstract":"Implementations of hardware accelerators for neural networks are increasingly popular on FPGAs, due to flexibility, achievable performance and efficiency gains resulting from network optimisations. The long compilation time required by the backend toolflow, however, makes rapid deployment and prototyping of such accelerators on FPGAs more difficult. Moreover, achieving high frequency of operation requires significant low-level design effort. We present a neural network overlay for FPGAs that exploits DSP blocks, operating at near their theoretical maximum frequency, while minimizing resource utilization. The proposed architecture is flexible, enabling rapid runtime configuration of network parameters according to the desired network topology. It is tailored for lightweight edge implementations requiring acceleration, rather than the highest throughput achieved by more complex architectures in the datacenter.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130832515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synchronizing On-Chip Software and Hardware Traces for HLS-Accelerated Programs","authors":"M. Ashcraft, Jeffrey B. Goeders","doi":"10.1109/ICFPT47387.2019.00015","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00015","url":null,"abstract":"Complex designs generated from modern high-level synthesis tools allow users to take advantage of heterogeneous systems, splitting the execution of programs between conventional processors, and hardware accelerators. While modern HLS tools continue to improve in efficiency and capability, debugging these designs has received relatively minor attention. Fortunately, recent academic work has provided the first means to debug these designs using hardware and software traces. Though these traces allow the user to analyze the flow of execution on both the software and hardware individually, they provide no means of synchronization to determine how operations on one device affect the other. We address this challenge by introducing a synchronization technique that keeps track of operations on shared objects. We identify objects shared between hardware and software and their memory operations, and use unique identifiers to synchronize the traces around these operations. We explore the added costs of this technique on execution time and hardware and software resources, and ways to reduce it through multiple synchronization schemes. This is demonstrated in an open-source prototype targeting the hybrid flow of the open-source HLS-tool LegUp.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131262092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Memory Access Locality for Vectorized Bit-Serial Matrix Multiplication in Reconfigurable Computing","authors":"Lahiru Rasnayake, Magnus Själander","doi":"10.1109/ICFPT47387.2019.00081","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00081","url":null,"abstract":"Low-precision matrix multiplication has gained significant interest in the research community due to its applicability in the quantized neural network domain. As a result, a multitude of variable precision hardware designs have been proposed since fixed-precision hardware causes under-utilization of the hardware resources due to the low and varying precision in such applications. Bit-serial hardware takes advantage of the frugal nature of bit-serial computations that can operate on only as many bits as necessary. A bit-serial matrix multiplication consists of a summation of weighted binary matrix multiplications. In this work, we study the inherent locality of bit-serial matrix multiplications and propose a locality-aware scheduling algorithm that eliminates redundant data fetches from memory. The proposed schedule improves with up to 76% compared to a schedule that computes each binary matrix multiplication in sequence.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121803485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supporters and sponsors","authors":"","doi":"10.1109/icfpt47387.2019.00008","DOIUrl":"https://doi.org/10.1109/icfpt47387.2019.00008","url":null,"abstract":"","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125565888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Winograd-Based Real-Time Super-Resolution System on FPGA","authors":"Bizhao Shi, Zhucheng Tang, Guojie Luo, M. Jiang","doi":"10.1109/ICFPT47387.2019.00083","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00083","url":null,"abstract":"With the rapid development of computer vision theory and visual display devices, High Frame Rate (HFR) and Ultra High Definition (UHD) techniques have received increasing attention from academic and industry. As they put high demands on performance and energy-efficiency, efficient customized hardware is required. In this paper, we propose an FPGA-based super-resolution system that enables real-time UHD upscaling in both high image quality and high frame rates. Our system crops each frame into blocks, measures their total variation values, and dispatches them accordingly to a neural network or an interpolation module for upscaling. We also propose a fast transposed convolution algorithm based on Winograd algorithm, which reduces the number of multiplications. Experimental results show that the proposed super-resolution system achieves superior performance in both reconstruction performance and efficiency over previous works.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116684399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimized Polynomial Multiplier Over Commutative Rings on FPGAs: A Case Study on BIKE","authors":"Jingwei Hu, Wen Wang, R. Cheung, Huaxiong Wang","doi":"10.1109/ICFPT47387.2019.00035","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00035","url":null,"abstract":"In this paper, we present two constant-time FPGAbased polynomial multipliers for post-quantum secure key encapsulation mechanisms based on quasi-cyclic codes, which are among round 2 candidates in the NIST PQC standardization process. The pipelined hardware architecture for polynomial multiplications proposed in this work are fully parameterized in terms of the size of the polynomial, and can be further tuned flexibly to achieve a trade-off between time and area depending on individual needs. We also present a case study on the BIKE key generators which use these two polynomial multiplier architectures as building blocks. Compared with the state-of-the-art hardware implementation of BIKE, the design proposed in this work is around 9× faster in terms of run-time while maintaining an over 6× smaller time-area product.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116491130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms","authors":"D. Nguyen, V. Dang, K. Gaj","doi":"10.1109/ICFPT47387.2019.00070","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00070","url":null,"abstract":"Due to an emerging threat of quantum computing, one of the major challenges facing the cryptographic community is a timely transition from traditional public-key cryptosystems, such as RSA and Elliptic Curve Cryptography, to a new class of algorithms, collectively referred to as Post-Quantum Cryptography (PQC). Several promising candidates for a new PQC standard can have their software and hardware implementations accelerated using the Num-ber Theoretic Transform (NTT). In this paper, we present an improved hardware architecture for NTT, with the hardware-friendly modular reduction, and demonstrate that this architecture can be efficiently implemented in hardware using High-Level Synthesis (HLS). The novel feature of the proposed architecture is an original memory write-back scheme, which assists in preparing coefficients for performing later NTT stages, saving memory storage used for precomputed constants. Our design is the most efficient for the case when log2N is even. The latency of our proposed architecture is approximately equal to (N log2(N) +3N)/4 clock cycles. As a proof of concept, we implemented the NTT operation for several parameter sets used in the PQC algorithms NewHope, FALCON, qTESLA, and CRYSTALS-DILITHIUM.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126542008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Jones, Keegan Pepper, Ai Li, Shiyue Li, Yuteng Zhang, D. Bailey
{"title":"Autonomous Driving Developed with an FPGA Design","authors":"E. Jones, Keegan Pepper, Ai Li, Shiyue Li, Yuteng Zhang, D. Bailey","doi":"10.1109/ICFPT47387.2019.00085","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00085","url":null,"abstract":"For this project, the task is developing algorithms to program a FPGA controlled vehicle for the FPT'19 design competition. This competition is to encourage the development of level 5 self-driving cars. To achieve level 5 self-driving cars, the use of image processing for object detection, lane detection, traffic light detection and pedestrian detection will be required. With this detection, the vehicle can be guided safely around the track provided. This paper summarises the algorithms developed to achieve autonomous driving techniques, following the regulations of the competition.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126692331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Partitioning FPGA-Optimized Systolic Arrays for Fun and Profit","authors":"Long Chung Chan, G. Malik, Nachiket Kapre","doi":"10.1109/ICFPT47387.2019.00025","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00025","url":null,"abstract":"We can improve the inference throughput of deep convolutional networks mapped to FPGA-optimized systolic arrays, at the expense of latency, with array partitioning and layer pipelining. Modern convolutional networks have a growing number of layers, such as the 58 separable layer GoogleNetv1, with varying compute, storage, and data movement requirements. At the same time, modern high-end FPGAs, such as the Xilinx UltraScale+ VU37P, can accommodate high-performance, 650 MHz, layouts of large 1920x9 systolic arrays. These can stay underutilized if the network layer requirements do not match the array size. We formulate an optimization problem, for improving array utilization, and boosting inference throughput, that determines how to partition the systolic array on the FPGA chip, and how to slice the network layers across the array partitions in a pipelined fashion. We adopt a two phase approach where (1) we identify layer assignment for each partition using an Evolutionary Strategy, and (2) we adopt a greedy-but-optimal approach for resource allocation to select the systolic array dimensions of each partition. When compared to state-of-the-art systolic architectures, we show throughput improvements in the range 1.3-1.5x and latency improvements in the range 0.5-1.8x against Multi-CLP and Xilinx SuperTile.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116034860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}