ACM Transactions on Reconfigurable Technology and Systems最新文献

筛选
英文 中文
End-to-end codesign of Hessian-aware quantized neural networks for FPGAs 面向 FPGA 的 Hessian 感知量化神经网络的端到端编码设计
IF 2.3 4区 计算机科学
ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-05-11 DOI: 10.1145/3662000
Javier Campos, Jovan Mitrevski, Nhan Tran, Zhen Dong, Amir Gholaminejad, Michael W. Mahoney, Javier Duarte
{"title":"End-to-end codesign of Hessian-aware quantized neural networks for FPGAs","authors":"Javier Campos, Jovan Mitrevski, Nhan Tran, Zhen Dong, Amir Gholaminejad, Michael W. Mahoney, Javier Duarte","doi":"10.1145/3662000","DOIUrl":"https://doi.org/10.1145/3662000","url":null,"abstract":"<p>We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpiling NNs into FPGA firmware. This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow that can be deployed for real-time machine-learning applications in a wide range of scientific and industrial settings. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the CERN Large Hadron Collider (LHC). Given the high collision rate, all data processing must be implemented on FPGA hardware within the strict area and latency requirements. Based on these constraints, we implement an optimized mixed-precision NN classifier for high-momentum particle jets in simulated LHC proton-proton collisions.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"44 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration DyRecMul:使用动态重配置的 FPGA 快速低成本近似乘法器
IF 2.3 4区 计算机科学
ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-05-01 DOI: 10.1145/3663480
Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J.M. Pierre Langlois
{"title":"DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration","authors":"Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J.M. Pierre Langlois","doi":"10.1145/3663480","DOIUrl":"https://doi.org/10.1145/3663480","url":null,"abstract":"<p>Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. The paper also facilitates the generalization of the proposed approximate multiplier idea to other datatypes, providing analysis and estimations for hardware cost and accuracy as a function of multiplier parameters. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"31 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs Dynamic-ACTS - 面向 HBM FPGA 的动态图形分析加速器
IF 2.3 4区 计算机科学
ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-30 DOI: 10.1145/3662002
Oluwole Jaiyeoba, Kevin Skadron
{"title":"Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs","authors":"Oluwole Jaiyeoba, Kevin Skadron","doi":"10.1145/3662002","DOIUrl":"https://doi.org/10.1145/3662002","url":null,"abstract":"<p>Graph processing frameworks suffer performance degradation from under-utilization of available memory bandwidth, because graph traversal often exhibits poor locality. A prior work, ACTS [24], accelerates graph processing with FPGAs and High Bandwidth Memory (HBM). ACTS achieves locality by partitioning vertex-update messages (based on destination vertex IDs) generated online after active edges have been processed. This work introduces Dynamic-ACTS which builds on ideas in ACTS to support dynamic graphs. The key innovation is to use a hash table to find the edges to be updated. Compared to Gunrock, a GPU graph engine, Dynamic-ACTS achieves a geometric mean speedup of 1.5X, with a maximum speedup of 4.6X. Compared to GraphLily, an FPGA-HBM graph engine, Dynamic-ACTS achieves a geometric speedup of 3.6X, with a maximum speedup of 16.5X. Our results also showed a geometric mean power reduction of 50% and a mean reduction of energy-delay product of 88% over Gunrock. Compared to GraSU, an FPGA graph updating engine, Dynamic-ACTS achieves an average speedup of 15X.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"12 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NC-Library: Expanding SystemC Capabilities for Nested reConfigurable Hardware Modelling NC 库:为嵌套式可重构硬件建模扩展 SystemC 功能
IF 2.3 4区 计算机科学
ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-27 DOI: 10.1145/3662001
Julian Haase, Najdet Charaf, Alexander Groß, Diana Göhringer
{"title":"NC-Library: Expanding SystemC Capabilities for Nested reConfigurable Hardware Modelling","authors":"Julian Haase, Najdet Charaf, Alexander Groß, Diana Göhringer","doi":"10.1145/3662001","DOIUrl":"https://doi.org/10.1145/3662001","url":null,"abstract":"<p>As runtime reconfiguration is used in an increasing number of hardware architectures, new simulation and modeling tools are needed to support the developer during the design phases. In this article, a language extension for SystemC is presented, together with a design methodology for the description and simulation of dynamically reconfigurable hardware at different levels of abstraction. The library presented offers a high degree of flexibility in the description of reconfiguration features and their management, while allowing runtime reconfiguration simulation, removal, and replacement of custom modules as well as third-party components throughout the architecture development process. In addition, our approach supports the emerging concept of nested reconfiguration and split regions with a minimal simulation overhead of a maximum of three delta cycles for signal and transaction forwarding, and four delta cycles for the reconfiguration process.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"157 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140798806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration PQA:探索产品量化在 DNN 硬件加速中的潜力
IF 2.3 4区 计算机科学
ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-18 DOI: 10.1145/3656643
Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah
{"title":"PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration","authors":"Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah","doi":"10.1145/3656643","DOIUrl":"https://doi.org/10.1145/3656643","url":null,"abstract":"<p>Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), especially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. To better understand the efficiency tradeoffs of product-quantized DNNs (PQ-DNNs), we create a custom hardware accelerator to parallelize and accelerate nearest-neighbor search and dot-product lookups. Additionally, we perform an empirical study to investigate the efficiency–accuracy tradeoffs of different PQ parameterizations and training methods. We identify PQ configurations that improve performance-per-area for ResNet20 by up to 3.1 ×, even when compared to a highly optimized conventional DNN accelerator, with similar improvements on two additional compact DNNs. When comparing to recent PQ solutions, we outperform prior work by 4 × in terms of performance-per-area with a 0.6% accuracy degradation. Finally, we reduce the bitwidth of PQ operations to investigate the impact on both hardware efficiency and accuracy. With only 2–6-bit precision on three compact DNNs, we were able to maintain DNN accuracy eliminating the need for DSPs.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward FPGA Intellectual Property (IP) Encryption from Netlist to Bitstream 实现从网表到比特流的 FPGA 知识产权 (IP) 加密
IF 2.3 4区 计算机科学
ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-12 DOI: 10.1145/3656644
Daniel Hutchings, Adam Taylor, Jeffrey Goeders
{"title":"Toward FPGA Intellectual Property (IP) Encryption from Netlist to Bitstream","authors":"Daniel Hutchings, Adam Taylor, Jeffrey Goeders","doi":"10.1145/3656644","DOIUrl":"https://doi.org/10.1145/3656644","url":null,"abstract":"<p>Current IP encryption methods offered by FPGA vendors use an approach where the IP is decrypted during the CAD flow, and remains unencrypted in the bitstream. Given the ease of accessing modern bitstream-to-netlist tools, encrypted IP is vulnerable to inspection and theft from the IP user. While the entire bitstream can be encrypted, this is done by the user, and is not a mechanism to protect confidentiality of 3rd party IP. </p><p>In this work we present a design methodology, along with a proof-of-concept tool, that demonstrates how IP can remain partially encrypted through the CAD flow and into the bitstream. We show how this approach can support multiple encryption keys from different vendors, and can be deployed using existing CAD tools and FPGA families. Our results document the benefits and costs of using such an approach to provide much greater protection for 3rd party IP.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"247 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration HierCGRA:利用分层建模和自动设计空间探索实现大规模 CGRA 的新型框架
IF 2.3 4区 计算机科学
ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-08 DOI: 10.1145/3656176
Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang
{"title":"HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration","authors":"Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang","doi":"10.1145/3656176","DOIUrl":"https://doi.org/10.1145/3656176","url":null,"abstract":"<p>Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture design and solving large-scale mapping problems. To address these challenges, we introduce HierCGRA, a novel framework that integrates hierarchical CGRA modeling, Chisel-based Verilog generation, LLVM-based data flow graph (DFG) generation, DFG mapping, and design space exploration (DSE). With the graph homomorphism (GH) mapping algorithm, HierCGRA achieves a faster mapping speed and higher PE utilization rate compared with the existing state-of-the-art CGRA frameworks. The proposed hierarchical mapping strategy achieves 41× speedup on average compared with the ILP mapping algorithm in CGRA-ME. Furthermore, the automated DSE based on Bayesian optimization achieves a significant performance improvement by the heterogeneity of PEs and interconnections. With these features, HierCGRA enables the agile development for large-scale CGRA and accelerates the process of finding a better CGRA architecture.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"34 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA R-锁:高能效、灵活、可编程的 CGRA
IF 2.3 4区 计算机科学
ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-08 DOI: 10.1145/3656642
Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal
{"title":"R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA","authors":"Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal","doi":"10.1145/3656642","DOIUrl":"https://doi.org/10.1145/3656642","url":null,"abstract":"<p>Emerging data-driven applications in the embedded, e-Health, and internet of things (IoT) domain require complex on-device signal analysis and data reduction to maximize energy efficiency on these energy-constrained devices. Coarse-grained reconfigurable architectures (CGRAs) have been proposed as a good compromise between flexibility and energy efficiency for ultra-low power (ULP) signal processing. Existing CGRAs are often specialized and domain-specific or can only accelerate simple kernels, which makes accelerating complete applications on a CGRA while maintaining high energy efficiency an open issue. Moreover, the lack of instruction set architecture (ISA) standardization across CGRAs makes code generation using current compiler technology a major challenge. This work introduces R-Blocks; a ULP CGRA with HW/SW co-design tool-flow based on the OpenASIP toolset. This CGRA is extremely flexible due to its well-established VLIW-SIMD execution model and support for flexible SIMD-processing, while maintaining an extremely high energy efficiency using software bypassing, optimized instruction delivery, and local scratchpad memories. R-Blocks is synthesized in a commercial 22-nm FD-SOI technology and achieves a full-system energy efficiency of 115 MOPS/mW on a common FFT benchmark, 1.45 × higher than a highly tuned embedded RISC-V processor. Comparable energy efficiency is obtained on multiple complex workloads, making R-Blocks a promising acceleration target for general-purpose computing.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"191 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference 了解基于 FPGA 空间加速的大型语言模型推理的潜力
IF 2.3 4区 计算机科学
ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-04 DOI: 10.1145/3656177
Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang
{"title":"Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference","authors":"Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang","doi":"10.1145/3656177","DOIUrl":"https://doi.org/10.1145/3656177","url":null,"abstract":"<p>Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. </p><p>This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. </p><p>To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4 × speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2 × speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9 × speedup and a 5.7 × improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"36 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DANSEN: Database Acceleration on Native Computational Storage by Exploiting NDP DANSEN:利用 NDP 在本地计算存储上加速数据库
IF 2.3 4区 计算机科学
ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-04-04 DOI: 10.1145/3655625
Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch
{"title":"DANSEN: Database Acceleration on Native Computational Storage by Exploiting NDP","authors":"Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch","doi":"10.1145/3655625","DOIUrl":"https://doi.org/10.1145/3655625","url":null,"abstract":"<p>This paper introduces <sans-serif>DANSEN</sans-serif>, the hardware accelerator component for neoDBMS, a full-stack computational storage system designed to manage on-device execution of database queries/transactions as a Near-Data Processing (NDP)-operation. The proposed system enables Database Management Systems (DBMS) to offload NDP-operations to the storage while maintaining control over data through a <i>native storage interface</i>. <sans-serif>DANSEN</sans-serif> provides an NDP-engine that enables DBMS to perform both low-level database tasks, such as performing database administration, as well as high-level tasks like executing SQL, <i>on</i> the smart storage device while observing the DBMS concurrency control. Furthermore, <sans-serif>DANSEN</sans-serif> enables the incorporation of custom accelerators as an NDP-operation, e.g., to perform hardware-accelerated ML inference directly on the stored data. We built the <sans-serif>DANSEN</sans-serif> storage prototype and interface on an Ultrascale+HBM FPGA and fully integrated it with PostgreSQL 12. Experimental results demonstrate that the proposed NDP approach outperforms software-only PostgreSQL using a fast off-the-shelf NVMe drive, and significantly improves the end-to-end execution time of an aggregation operation (similar to Q6 from CH-benCHmark, 150 million records) by ≈ 10.6 ×. The versatility of the proposed approach is also validated by integrating a compute-intensive data analytics application with multi-row results, outperforming PostgreSQL by ≈ 1.5 ×.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"101 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信