2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)最新文献_第6页

TD3lite: FPGA Acceleration of Reinforcement Learning with Structural and Representation Optimizations 基于结构和表示优化的FPGA加速强化学习

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00023

Chan-Wei Hu, Jiangkun Hu, S. Khatri

{"title":"TD3lite: FPGA Acceleration of Reinforcement Learning with Structural and Representation Optimizations","authors":"Chan-Wei Hu, Jiangkun Hu, S. Khatri","doi":"10.1109/FPL57034.2022.00023","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00023","url":null,"abstract":"Reinforcement learning (RL) is an effective and increasingly popular machine learning approach for optimization and decision-making. However, modern reinforcement learning techniques, such as deep Q-learning, often require neural network inference and training, and therefore are computationally expensive. For example, Twin-Delay Deep Deterministic Policy Gradient (TD3), a state-of-the-art RL technique, uses as many as 6 neural networks. In this work, we study the FPGA-based acceleration of TD3. To address the resource and computational overhead due to inference and training of the multiple neural networks of TD3, we propose TD3lite, an integrated approach consisting of a network sharing technique combined with bitwidth-optimized block floating-point arithmetic. TD3lite is evaluated on several robotic benchmarks with continuous state and action spaces. With only 5.7% learning performance degradation, TD3lite achieves 21 ×and 8 ×speedup compared to CPU and GPU implementations, respectively. Its energy efficiency is 26 ×of the GPU implementation. Moreover, it utilizes ~ 25 - 40% fewer FPGA resources compared to a conventional sinale-precision floating-point representation of TD3.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130503999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ERMES: Efficient Racetrack Memory Emulation System based on FPGA 基于FPGA的高效赛道内存仿真系统

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00059

F. Spagnolo, Salim Ullah, P. Corsonello, Akash Kumar

{"title":"ERMES: Efficient Racetrack Memory Emulation System based on FPGA","authors":"F. Spagnolo, Salim Ullah, P. Corsonello, Akash Kumar","doi":"10.1109/FPL57034.2022.00059","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00059","url":null,"abstract":"With the scaling of CMOS technology almost over, non-volatile memories based on emerging technologies are gaining considerable popularity. Particularly, spintronic-based Racetrack memories (RTMs) exhibit unprecedented storage capacity, as well as reduced energy per operation and high write endurance, which make them promising candidates to revolutionize the architecture of memory sub-systems. However, since RTM exploits shifting of magnetic domains to align the required data with the access port, its read/write latency is not constant. Due to this behaviour, several performance optimizations related to the target application may be introduced either on memory architecture or data placement or both. To this purpose, specific tools able to emulate the timing characteristics of RTMs are highly desired. Unfortunately, existing software-based simulators show poor flexibility and run-time. To address such limitations, this paper presents a new emulation system for RTMs based on heterogeneous FPGA-CPU Systems-on-Chips (SoCs). Thanks to its high flexibility, the proposed emulator can be easily configured to evaluate different memory architectures. In addition, the CPU can be used to stimulate the RTM architecture under test with appropriate benchmarks, thus providing a fast self-contained evaluation environment. As case study, ERMES has been implemented within the Xilinx Zynq Ultrascale XCUZ9EG SoC to evaluate performances of several memory configurations when running benchmark applications from the MiBench suite, experiencing a speed-up higher than × 146 over software-based simulators.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128980146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DeLiBA: An Open-Source Hardware/Software Framework for the Development of Linux Block I/O Accelerators DeLiBA:开发Linux块I/O加速器的开源硬件/软件框架

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00038

Babar Khan, Carsten Heinz, A. Koch

{"title":"DeLiBA: An Open-Source Hardware/Software Framework for the Development of Linux Block I/O Accelerators","authors":"Babar Khan, Carsten Heinz, A. Koch","doi":"10.1109/FPL57034.2022.00038","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00038","url":null,"abstract":"With the trend towards ever larger “big data” applications, many of the gains achievable by using specialized compute accelerators become diminished due to the growing I/O overheads. While there have been a number of research efforts into computational storage and FPGA implementations of the NVMe interface, to our knowledge there have been only very limited efforts to move larger parts of the Linux block I/O stack into FPGA-based hardware accelerators. Our hardware/software framework DeLiBA aims to address this deficiency by allowing high-productivity development of software components of the I/O stack in user instead of kernel space, and leverages a proven FPGA SoC framework to quickly compose and deploy the actual FPGA-based I/O accelerators. While the current version of DeLiBA is focused on enabling more productive research instead of on raw performance, even in its current form it achieves 10% higher throughput and up to 2.3x the I/Os per second for a proof-of-concept Ceph accelerator realized using the system. These initial results show the large potential of performing further research in this acceleration domain.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129503828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Breaking an FPGA-Integrated NIST SP 800-193 Compliant TRNG Hard-IP Core with On-Chip Voltage-Based Fault Attacks 基于片上电压的故障攻击破坏fpga集成的NIST SP 800-193兼容TRNG硬ip核

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00066

Dennis R. E. Gnad, Jiaqi Hu, M. Tahoori

{"title":"Breaking an FPGA-Integrated NIST SP 800-193 Compliant TRNG Hard-IP Core with On-Chip Voltage-Based Fault Attacks","authors":"Dennis R. E. Gnad, Jiaqi Hu, M. Tahoori","doi":"10.1109/FPL57034.2022.00066","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00066","url":null,"abstract":"Practical cryptographic systems rely on a true random number generator (TRNG), which is a necessary component in any hardware Root-of-Trust (RoT). Hardware trust anchors are also integrated into larger chips, for instance as hard-IP cores in FPGAs, where the remaining FPGA fabric is freely programmable. To provide security guarantees, proper operation of the TRNG is critical. By that, adversaries are interested to tamper with the ability of TRNGs to produce unpredictable random numbers. In this paper, we show that an FPGA on-chip attack can reduce the true randomness of a TRNG integrated as a hard-IP module in the FPGA. This module is considered to be an immutable security module, compliant with NIST SP 800– 193 Platform Firmware Resilience Guidelines (PFR), which is a well known guideline for system resilience, and it is also certified by the Cryptographic Algorithm Validation Program (CAVP). By performing an on-chip voltage drop-based fault attack with user-programmable FPGA logic, the random numbers produced by the IP core fail NIST SP 800–22 and BSI AIS31 tests, meaning they are not truly random anymore. By that, this paper shows that new attack vectors can break even verified IP cores, since on-chip attacks are usually not considered in the threat model, which can still affect highly integrated systems.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130093780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FRA-FPGA: Fast Reconfigurable Automata Processing on FPGAs 基于fpga的快速可重构自动机处理

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00055

Peng Zhang, Shijun Zhang, Shang Li, Jin Zhang, Shaoxun Liu, Youjun Bu

{"title":"FRA-FPGA: Fast Reconfigurable Automata Processing on FPGAs","authors":"Peng Zhang, Shijun Zhang, Shang Li, Jin Zhang, Shaoxun Liu, Youjun Bu","doi":"10.1109/FPL57034.2022.00055","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00055","url":null,"abstract":"Accelerating regular expression (regex) matching, or equivalently finite automata processing, using FPGAs is widely adopted by many demanding regex-based applications to improve throughput and power efficiency. However, offloading a large regex rule set entirely into an FPGA is expensive, if not unaffordable, due to the limited on-chip resources. In this paper, we propose FRA-FPGA (Fast Reconfigurable Automata on FPGAs), a homogeneous NFA architecture on FPGAs which can be reconfigured within 1μs. Meanwhile, the reconfiguration time of FRA-FPGA is independent of the number of regex rules it accommodates. Because FRA-FPGA can be reloaded quickly, it is feasible to offload the small subset of activated regex rules into FRA-FPGA dynamically, as opposed to compiling the whole regex rule set into FPGA beforehand. We implemented FRA-FPGA on the Xilinx U200 card to accelerate Hyperscan. Our experimental results show that the FRA-FPGA can improve Hyperscan's throughput by about 15 times (stream mode) and 33 times (block mode), respectively, while consuming only 4.23% logic resources and 16.64% memory resources of the FPGA(VU9P).","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134356280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A High-Performance FPGA Accelerator for CUR Decomposition 一种用于CUR分解的高性能FPGA加速器

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00052

M. Abdelgawad, R. Cheung, Hong Yan

{"title":"A High-Performance FPGA Accelerator for CUR Decomposition","authors":"M. Abdelgawad, R. Cheung, Hong Yan","doi":"10.1109/FPL57034.2022.00052","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00052","url":null,"abstract":"A matrix factorization is to decompose a matrix into a product of smaller matrices. It is widely used in machine learning algorithms. There are many matrix decomposition algorithms, and each has various applications. CUR matrix decomposition is a widely-used factorization tool that has been employed for dimension reduction and pattern recognition in many scientific and engineering applications, such as image processing, text mining, and wireless communications. In this paper we propose an efficient FPGA-based floating-point accelerator using high-level synthesis (HLS) for the CUR decomposition algorithm. Our experiment results demonstrate the better efficiency of our hardware design compared to the optimized CPU-based software solutions. The speedup of our FPGA-based architecture over the optimized software implementation ranges from 2.37 to 16.82 times for different dimensions of the data input matrix. We evaluated our design using large dimension matrices 1024 x 1024 and 2048 x 2048 and the experiment results demonstrated the efficiency of our design in terms of the utilized resources and latency. Finally, we have compared our design with other matrix decomposition algorithms such as SVD and QR decomposition, the experiment results demonstrated that CUR is more efficient than SVD and QR decomposition in terms of latency and required resources.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"268 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115108382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Comprehensive Timing Model for Accurate Frequency Tuning in Dataflow Circuits 数据流电路中精确频率调谐的综合时序模型

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00063

Carmine Rizzi, Andrea Guerrieri, P. Ienne, Lana Josipović

引用次数: 4

FPL Demo: Runtime Stream Processing with Resource-Elastic Pipelines on FPGAs FPL演示:fpga上资源弹性管道的运行时流处理

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00082

Kaspar Mätas, Kristiyan Manev, Joseph Powell, Dirk Koch

引用次数: 1

Resource Optimal Squarers for FPGAs fpga的资源最优平方

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00018

Andreas Böttcher, M. Kumm, F. D. Dinechin

引用次数: 0

SDMA: An Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs SDMA:一种高效灵活的稀疏密集矩阵乘法gnn架构

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00054

Yingxue Gao, Lei Gong, Chao Wang, Teng Wang, Xuehai Zhou

{"title":"SDMA: An Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs","authors":"Yingxue Gao, Lei Gong, Chao Wang, Teng Wang, Xuehai Zhou","doi":"10.1109/FPL57034.2022.00054","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00054","url":null,"abstract":"In recent years, graph neural networks (GNNs) as a deep learning model have emerged. Sparse-Dense Matrix Multiplication (SpMM) is the critical component of GNNs. However, SpMM involves many irregular calculations and random memory accesses, resulting in the inefficiency of general-purpose processors and dedicated accelerators. The highly sparse and uneven distribution of the graph further exacerbates the above problems. In this work, we propose SDMA, an efficient architecture to accelerate SpMM for GNNs. SDMA can collaboratively address the challenges of load imbalance and irregular memory accesses. We first present three hardware-oriented optimization methods: 1) The Equal-value partition method effectively divides the sparse matrix to achieve load balancing between tiles. 2) The vertex-clustering optimization method can explore more data locality. 3) An adaptive on-chip dataflow scheduling method is proposed to make full use of computing resources. Then, we combine and integrate the above optimization into SDMA to achieve a high-performance architecture. Finally, we prototype SDMA on the Xilinx Alveo U50 FPGA. The results demonstrate that SDMA achieves 2.19x-3.35x energy efficiency over the GPU implementation and 2.03x DSP efficiency over the FPGA implementation.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"118 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129245827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3