ACM Transactions on Reconfigurable Technology and Systems最新文献

End-to-end codesign of Hessian-aware quantized neural networks for FPGAs 面向 FPGA 的 Hessian 感知量化神经网络的端到端编码设计

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-05-11 DOI: 10.1145/3662000

Javier Campos, Jovan Mitrevski, Nhan Tran, Zhen Dong, Amir Gholaminejad, Michael W. Mahoney, Javier Duarte

{"title":"End-to-end codesign of Hessian-aware quantized neural networks for FPGAs","authors":"Javier Campos, Jovan Mitrevski, Nhan Tran, Zhen Dong, Amir Gholaminejad, Michael W. Mahoney, Javier Duarte","doi":"10.1145/3662000","DOIUrl":"https://doi.org/10.1145/3662000","url":null,"abstract":"We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpiling NNs into FPGA firmware. This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow that can be deployed for real-time machine-learning applications in a wide range of scientific and industrial settings. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the CERN Large Hadron Collider (LHC). Given the high collision rate, all data processing must be implemented on FPGA hardware within the strict area and latency requirements. Based on these constraints, we implement an optimized mixed-precision NN classifier for high-momentum particle jets in simulated LHC proton-proton collisions.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"44 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration DyRecMul：使用动态重配置的 FPGA 快速低成本近似乘法器

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-05-01 DOI: 10.1145/3663480

Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J.M. Pierre Langlois

{"title":"DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration","authors":"Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J.M. Pierre Langlois","doi":"10.1145/3663480","DOIUrl":"https://doi.org/10.1145/3663480","url":null,"abstract":"Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. The paper also facilitates the generalization of the proposed approximate multiplier idea to other datatypes, providing analysis and estimations for hardware cost and accuracy as a function of multiplier parameters. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"31 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs Dynamic-ACTS - 面向 HBM FPGA 的动态图形分析加速器

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-30 DOI: 10.1145/3662002

Oluwole Jaiyeoba, Kevin Skadron

引用次数: 0

NC-Library: Expanding SystemC Capabilities for Nested reConfigurable Hardware Modelling NC 库：为嵌套式可重构硬件建模扩展 SystemC 功能

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-27 DOI: 10.1145/3662001

Julian Haase, Najdet Charaf, Alexander Groß, Diana Göhringer

引用次数: 0

PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration PQA：探索产品量化在 DNN 硬件加速中的潜力

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-18 DOI: 10.1145/3656643

Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah

{"title":"PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration","authors":"Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah","doi":"10.1145/3656643","DOIUrl":"https://doi.org/10.1145/3656643","url":null,"abstract":"Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), especially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. To better understand the efficiency tradeoffs of product-quantized DNNs (PQ-DNNs), we create a custom hardware accelerator to parallelize and accelerate nearest-neighbor search and dot-product lookups. Additionally, we perform an empirical study to investigate the efficiency–accuracy tradeoffs of different PQ parameterizations and training methods. We identify PQ configurations that improve performance-per-area for ResNet20 by up to 3.1 ×, even when compared to a highly optimized conventional DNN accelerator, with similar improvements on two additional compact DNNs. When comparing to recent PQ solutions, we outperform prior work by 4 × in terms of performance-per-area with a 0.6% accuracy degradation. Finally, we reduce the bitwidth of PQ operations to investigate the impact on both hardware efficiency and accuracy. With only 2–6-bit precision on three compact DNNs, we were able to maintain DNN accuracy eliminating the need for DSPs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toward FPGA Intellectual Property (IP) Encryption from Netlist to Bitstream 实现从网表到比特流的 FPGA 知识产权 (IP) 加密

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-12 DOI: 10.1145/3656644

Daniel Hutchings, Adam Taylor, Jeffrey Goeders

引用次数: 0

HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration HierCGRA：利用分层建模和自动设计空间探索实现大规模 CGRA 的新型框架

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-08 DOI: 10.1145/3656176

Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang

{"title":"HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration","authors":"Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang","doi":"10.1145/3656176","DOIUrl":"https://doi.org/10.1145/3656176","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture design and solving large-scale mapping problems. To address these challenges, we introduce HierCGRA, a novel framework that integrates hierarchical CGRA modeling, Chisel-based Verilog generation, LLVM-based data flow graph (DFG) generation, DFG mapping, and design space exploration (DSE). With the graph homomorphism (GH) mapping algorithm, HierCGRA achieves a faster mapping speed and higher PE utilization rate compared with the existing state-of-the-art CGRA frameworks. The proposed hierarchical mapping strategy achieves 41× speedup on average compared with the ILP mapping algorithm in CGRA-ME. Furthermore, the automated DSE based on Bayesian optimization achieves a significant performance improvement by the heterogeneity of PEs and interconnections. With these features, HierCGRA enables the agile development for large-scale CGRA and accelerates the process of finding a better CGRA architecture.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"34 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA R-锁：高能效、灵活、可编程的 CGRA

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-08 DOI: 10.1145/3656642

Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal

{"title":"R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA","authors":"Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal","doi":"10.1145/3656642","DOIUrl":"https://doi.org/10.1145/3656642","url":null,"abstract":"Emerging data-driven applications in the embedded, e-Health, and internet of things (IoT) domain require complex on-device signal analysis and data reduction to maximize energy efficiency on these energy-constrained devices. Coarse-grained reconfigurable architectures (CGRAs) have been proposed as a good compromise between flexibility and energy efficiency for ultra-low power (ULP) signal processing. Existing CGRAs are often specialized and domain-specific or can only accelerate simple kernels, which makes accelerating complete applications on a CGRA while maintaining high energy efficiency an open issue. Moreover, the lack of instruction set architecture (ISA) standardization across CGRAs makes code generation using current compiler technology a major challenge. This work introduces R-Blocks; a ULP CGRA with HW/SW co-design tool-flow based on the OpenASIP toolset. This CGRA is extremely flexible due to its well-established VLIW-SIMD execution model and support for flexible SIMD-processing, while maintaining an extremely high energy efficiency using software bypassing, optimized instruction delivery, and local scratchpad memories. R-Blocks is synthesized in a commercial 22-nm FD-SOI technology and achieves a full-system energy efficiency of 115 MOPS/mW on a common FFT benchmark, 1.45 × higher than a highly tuned embedded RISC-V processor. Comparable energy efficiency is obtained on multiple complex workloads, making R-Blocks a promising acceleration target for general-purpose computing.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"191 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference 了解基于 FPGA 空间加速的大型语言模型推理的潜力

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-04 DOI: 10.1145/3656177

Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang

{"title":"Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference","authors":"Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang","doi":"10.1145/3656177","DOIUrl":"https://doi.org/10.1145/3656177","url":null,"abstract":"Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4 × speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2 × speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9 × speedup and a 5.7 × improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"36 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DANSEN: Database Acceleration on Native Computational Storage by Exploiting NDP DANSEN：利用 NDP 在本地计算存储上加速数据库

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-04 DOI: 10.1145/3655625

Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch

{"title":"DANSEN: Database Acceleration on Native Computational Storage by Exploiting NDP","authors":"Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch","doi":"10.1145/3655625","DOIUrl":"https://doi.org/10.1145/3655625","url":null,"abstract":"This paper introduces <sans-serif>DANSEN</sans-serif>, the hardware accelerator component for neoDBMS, a full-stack computational storage system designed to manage on-device execution of database queries/transactions as a Near-Data Processing (NDP)-operation. The proposed system enables Database Management Systems (DBMS) to offload NDP-operations to the storage while maintaining control over data through a native storage interface. <sans-serif>DANSEN</sans-serif> provides an NDP-engine that enables DBMS to perform both low-level database tasks, such as performing database administration, as well as high-level tasks like executing SQL, on the smart storage device while observing the DBMS concurrency control. Furthermore, <sans-serif>DANSEN</sans-serif> enables the incorporation of custom accelerators as an NDP-operation, e.g., to perform hardware-accelerated ML inference directly on the stored data. We built the <sans-serif>DANSEN</sans-serif> storage prototype and interface on an Ultrascale+HBM FPGA and fully integrated it with PostgreSQL 12. Experimental results demonstrate that the proposed NDP approach outperforms software-only PostgreSQL using a fast off-the-shelf NVMe drive, and significantly improves the end-to-end execution time of an aggregation operation (similar to Q6 from CH-benCHmark, 150 million records) by ≈ 10.6 ×. The versatility of the proposed approach is also validated by integrating a compute-intensive data analytics application with multi-row results, outperforming PostgreSQL by ≈ 1.5 ×.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"101 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0