{"title":"TD3lite: FPGA Acceleration of Reinforcement Learning with Structural and Representation Optimizations","authors":"Chan-Wei Hu, Jiangkun Hu, S. Khatri","doi":"10.1109/FPL57034.2022.00023","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00023","url":null,"abstract":"Reinforcement learning (RL) is an effective and increasingly popular machine learning approach for optimization and decision-making. However, modern reinforcement learning techniques, such as deep Q-learning, often require neural network inference and training, and therefore are computationally expensive. For example, Twin-Delay Deep Deterministic Policy Gradient (TD3), a state-of-the-art RL technique, uses as many as 6 neural networks. In this work, we study the FPGA-based acceleration of TD3. To address the resource and computational overhead due to inference and training of the multiple neural networks of TD3, we propose TD3lite, an integrated approach consisting of a network sharing technique combined with bitwidth-optimized block floating-point arithmetic. TD3lite is evaluated on several robotic benchmarks with continuous state and action spaces. With only 5.7% learning performance degradation, TD3lite achieves 21 ×and 8 ×speedup compared to CPU and GPU implementations, respectively. Its energy efficiency is 26 ×of the GPU implementation. Moreover, it utilizes ~ 25 - 40% fewer FPGA resources compared to a conventional sinale-precision floating-point representation of TD3.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130503999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Spagnolo, Salim Ullah, P. Corsonello, Akash Kumar
{"title":"ERMES: Efficient Racetrack Memory Emulation System based on FPGA","authors":"F. Spagnolo, Salim Ullah, P. Corsonello, Akash Kumar","doi":"10.1109/FPL57034.2022.00059","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00059","url":null,"abstract":"With the scaling of CMOS technology almost over, non-volatile memories based on emerging technologies are gaining considerable popularity. Particularly, spintronic-based Racetrack memories (RTMs) exhibit unprecedented storage capacity, as well as reduced energy per operation and high write endurance, which make them promising candidates to revolutionize the architecture of memory sub-systems. However, since RTM exploits shifting of magnetic domains to align the required data with the access port, its read/write latency is not constant. Due to this behaviour, several performance optimizations related to the target application may be introduced either on memory architecture or data placement or both. To this purpose, specific tools able to emulate the timing characteristics of RTMs are highly desired. Unfortunately, existing software-based simulators show poor flexibility and run-time. To address such limitations, this paper presents a new emulation system for RTMs based on heterogeneous FPGA-CPU Systems-on-Chips (SoCs). Thanks to its high flexibility, the proposed emulator can be easily configured to evaluate different memory architectures. In addition, the CPU can be used to stimulate the RTM architecture under test with appropriate benchmarks, thus providing a fast self-contained evaluation environment. As case study, ERMES has been implemented within the Xilinx Zynq Ultrascale XCUZ9EG SoC to evaluate performances of several memory configurations when running benchmark applications from the MiBench suite, experiencing a speed-up higher than × 146 over software-based simulators.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128980146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeLiBA: An Open-Source Hardware/Software Framework for the Development of Linux Block I/O Accelerators","authors":"Babar Khan, Carsten Heinz, A. Koch","doi":"10.1109/FPL57034.2022.00038","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00038","url":null,"abstract":"With the trend towards ever larger “big data” applications, many of the gains achievable by using specialized compute accelerators become diminished due to the growing I/O overheads. While there have been a number of research efforts into computational storage and FPGA implementations of the NVMe interface, to our knowledge there have been only very limited efforts to move larger parts of the Linux block I/O stack into FPGA-based hardware accelerators. Our hardware/software framework DeLiBA aims to address this deficiency by allowing high-productivity development of software components of the I/O stack in user instead of kernel space, and leverages a proven FPGA SoC framework to quickly compose and deploy the actual FPGA-based I/O accelerators. While the current version of DeLiBA is focused on enabling more productive research instead of on raw performance, even in its current form it achieves 10% higher throughput and up to 2.3x the I/Os per second for a proof-of-concept Ceph accelerator realized using the system. These initial results show the large potential of performing further research in this acceleration domain.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129503828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Breaking an FPGA-Integrated NIST SP 800-193 Compliant TRNG Hard-IP Core with On-Chip Voltage-Based Fault Attacks","authors":"Dennis R. E. Gnad, Jiaqi Hu, M. Tahoori","doi":"10.1109/FPL57034.2022.00066","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00066","url":null,"abstract":"Practical cryptographic systems rely on a true random number generator (TRNG), which is a necessary component in any hardware Root-of-Trust (RoT). Hardware trust anchors are also integrated into larger chips, for instance as hard-IP cores in FPGAs, where the remaining FPGA fabric is freely programmable. To provide security guarantees, proper operation of the TRNG is critical. By that, adversaries are interested to tamper with the ability of TRNGs to produce unpredictable random numbers. In this paper, we show that an FPGA on-chip attack can reduce the true randomness of a TRNG integrated as a hard-IP module in the FPGA. This module is considered to be an immutable security module, compliant with NIST SP 800– 193 Platform Firmware Resilience Guidelines (PFR), which is a well known guideline for system resilience, and it is also certified by the Cryptographic Algorithm Validation Program (CAVP). By performing an on-chip voltage drop-based fault attack with user-programmable FPGA logic, the random numbers produced by the IP core fail NIST SP 800–22 and BSI AIS31 tests, meaning they are not truly random anymore. By that, this paper shows that new attack vectors can break even verified IP cores, since on-chip attacks are usually not considered in the threat model, which can still affect highly integrated systems.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130093780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Zhang, Shijun Zhang, Shang Li, Jin Zhang, Shaoxun Liu, Youjun Bu
{"title":"FRA-FPGA: Fast Reconfigurable Automata Processing on FPGAs","authors":"Peng Zhang, Shijun Zhang, Shang Li, Jin Zhang, Shaoxun Liu, Youjun Bu","doi":"10.1109/FPL57034.2022.00055","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00055","url":null,"abstract":"Accelerating regular expression (regex) matching, or equivalently finite automata processing, using FPGAs is widely adopted by many demanding regex-based applications to improve throughput and power efficiency. However, offloading a large regex rule set entirely into an FPGA is expensive, if not unaffordable, due to the limited on-chip resources. In this paper, we propose FRA-FPGA (Fast Reconfigurable Automata on FPGAs), a homogeneous NFA architecture on FPGAs which can be reconfigured within 1μs. Meanwhile, the reconfiguration time of FRA-FPGA is independent of the number of regex rules it accommodates. Because FRA-FPGA can be reloaded quickly, it is feasible to offload the small subset of activated regex rules into FRA-FPGA dynamically, as opposed to compiling the whole regex rule set into FPGA beforehand. We implemented FRA-FPGA on the Xilinx U200 card to accelerate Hyperscan. Our experimental results show that the FRA-FPGA can improve Hyperscan's throughput by about 15 times (stream mode) and 33 times (block mode), respectively, while consuming only 4.23% logic resources and 16.64% memory resources of the FPGA(VU9P).","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134356280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A High-Performance FPGA Accelerator for CUR Decomposition","authors":"M. Abdelgawad, R. Cheung, Hong Yan","doi":"10.1109/FPL57034.2022.00052","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00052","url":null,"abstract":"A matrix factorization is to decompose a matrix into a product of smaller matrices. It is widely used in machine learning algorithms. There are many matrix decomposition algorithms, and each has various applications. CUR matrix decomposition is a widely-used factorization tool that has been employed for dimension reduction and pattern recognition in many scientific and engineering applications, such as image processing, text mining, and wireless communications. In this paper we propose an efficient FPGA-based floating-point accelerator using high-level synthesis (HLS) for the CUR decomposition algorithm. Our experiment results demonstrate the better efficiency of our hardware design compared to the optimized CPU-based software solutions. The speedup of our FPGA-based architecture over the optimized software implementation ranges from 2.37 to 16.82 times for different dimensions of the data input matrix. We evaluated our design using large dimension matrices 1024 x 1024 and 2048 x 2048 and the experiment results demonstrated the efficiency of our design in terms of the utilized resources and latency. Finally, we have compared our design with other matrix decomposition algorithms such as SVD and QR decomposition, the experiment results demonstrated that CUR is more efficient than SVD and QR decomposition in terms of latency and required resources.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"268 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115108382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carmine Rizzi, Andrea Guerrieri, P. Ienne, Lana Josipović
{"title":"A Comprehensive Timing Model for Accurate Frequency Tuning in Dataflow Circuits","authors":"Carmine Rizzi, Andrea Guerrieri, P. Ienne, Lana Josipović","doi":"10.1109/FPL57034.2022.00063","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00063","url":null,"abstract":"The ability of dataflow circuits to implement dynamic scheduling promises to overcome the conservatism of static scheduling techniques that high-level synthesis tools typically rely on. Yet, the same distributed control mechanism that allows dataflow circuits to achieve high-throughput pipelines when static scheduling cannot also causes long critical paths and frequency degradation. This effect reduces the overall performance benefits of dataflow circuits and makes them an undesirable solution in broad classes of applications. In this work, we provide an in-depth study of the timing of dataflow circuits. We develop a mathematical model that accurately captures combinational delays among different dataflow constructs and appropriately places buffers to control the critical path. On a set of benchmarks obtained from C code, we show that the circuits optimized by our technique accurately meet the clock period target and result in a critical path reduction of up to 38% compared to prior solutions.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115186372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaspar Mätas, Kristiyan Manev, Joseph Powell, Dirk Koch
{"title":"FPL Demo: Runtime Stream Processing with Resource-Elastic Pipelines on FPGAs","authors":"Kaspar Mätas, Kristiyan Manev, Joseph Powell, Dirk Koch","doi":"10.1109/FPL57034.2022.00082","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00082","url":null,"abstract":"FPGAs are efficient at dataflow applications, as demonstrated in various application domains, including machine learning, communication, and image processing. In this demo, we accelerate database management operations transparently to the user by stitching together partially reconfigurable stream processing modules that implement database operators. Our runtime system orchestrates this, which builds custom pipelines according to runtime conditions. This demo will showcase an acceleration of SQL queries using our dynamic stream processing system running on a ZCU102 FPGA board.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125201622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Resource Optimal Squarers for FPGAs","authors":"Andreas Böttcher, M. Kumm, F. D. Dinechin","doi":"10.1109/FPL57034.2022.00018","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00018","url":null,"abstract":"Squaring is an essential operation in computer arithmetic that can be considered as a special case of multiplication where several simplifications can be applied to reduce the complexity of the resulting circuit. However, the design of a squarer is not straightforward for modern FPGAs that provide embedded DSP blocks and look-up-tables (LUTs). This work proposes a flexible method to design resource optimal squarers, i.e., a squarer that uses a minimum number of LUTs for a user-defined number of DSP blocks. The method uses an integer linear programming (ILP) formulation based on a generalization of multiplier tiling. It is shown that the proposed squarer design method significantly improves the LUT utilization for a given number of DSPs over previous methods, while maintaining a similar critical path delay and latency.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129193588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingxue Gao, Lei Gong, Chao Wang, Teng Wang, Xuehai Zhou
{"title":"SDMA: An Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs","authors":"Yingxue Gao, Lei Gong, Chao Wang, Teng Wang, Xuehai Zhou","doi":"10.1109/FPL57034.2022.00054","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00054","url":null,"abstract":"In recent years, graph neural networks (GNNs) as a deep learning model have emerged. Sparse-Dense Matrix Multiplication (SpMM) is the critical component of GNNs. However, SpMM involves many irregular calculations and random memory accesses, resulting in the inefficiency of general-purpose processors and dedicated accelerators. The highly sparse and uneven distribution of the graph further exacerbates the above problems. In this work, we propose SDMA, an efficient architecture to accelerate SpMM for GNNs. SDMA can collaboratively address the challenges of load imbalance and irregular memory accesses. We first present three hardware-oriented optimization methods: 1) The Equal-value partition method effectively divides the sparse matrix to achieve load balancing between tiles. 2) The vertex-clustering optimization method can explore more data locality. 3) An adaptive on-chip dataflow scheduling method is proposed to make full use of computing resources. Then, we combine and integrate the above optimization into SDMA to achieve a high-performance architecture. Finally, we prototype SDMA on the Xilinx Alveo U50 FPGA. The results demonstrate that SDMA achieves 2.19x-3.35x energy efficiency over the GPU implementation and 2.03x DSP efficiency over the FPGA implementation.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"118 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129245827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}