ACM Transactions on Reconfigurable Technology and Systems最新文献_第10页

A Survey of Processing Systems for Phylogenetics and Population Genetics 系统发育与群体遗传学处理系统综述

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-03-16 DOI: 10.1145/3588033

Reinout Corts, Nikolaos S. Alachiotis

引用次数: 0

ZyPR: End-to-end Build Tool and Runtime Manager for Partial Reconfiguration of FPGA SoCs at the Edge ZyPR:端到端构建工具和运行时管理器，用于FPGA soc的边缘部分重新配置

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-02-27 DOI: 10.1145/3585521

Alex R. Bucknall, Suhaib A. Fahmy

{"title":"ZyPR: End-to-end Build Tool and Runtime Manager for Partial Reconfiguration of FPGA SoCs at the Edge","authors":"Alex R. Bucknall, Suhaib A. Fahmy","doi":"10.1145/3585521","DOIUrl":"https://doi.org/10.1145/3585521","url":null,"abstract":"Partial reconfiguration (PR) is a key enabler to the design and development of adaptive systems on modern Field Programmable Gate Array (FPGA) Systems-on-Chip (SoCs), allowing hardware to be adapted dynamically at runtime. Vendor-supported PR infrastructure is performance-limited and blocking, drivers entail complex memory management, and software/hardware design requires bespoke knowledge of the underlying hardware. This article presents ZyPR: a complete end-to-end framework that provides high-performance reconfiguration of hardware from within a software abstraction in the Linux userspace, automating the process of building PR applications with support for the Xilinx Zynq and Zynq UltraScale+ architectures, aimed at enabling non-expert application designers to leverage PR for edge applications. We compare ZyPR against traditional vendor tooling for PR management as well as recent open source tools that support PR under Linux. The framework provides a high-performance runtime along with low overhead for its provided abstractions. We introduce improvements to our previous work, increasing the provisioning throughput for PR bitstreams on the Zynq Ultrascale+ by 2× and 5.4× compared to Xilinx’s FPGA Manager.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 33"},"PeriodicalIF":2.3,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43056654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis AutoScaleDSE：用于高级综合的可扩展设计空间探索引擎

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-02-15 DOI: 10.1145/3572959

Hyegang Jun, Hanchen Ye, Hyunmin Jeong, Deming Chen

{"title":"AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis","authors":"Hyegang Jun, Hanchen Ye, Hyunmin Jeong, Deming Chen","doi":"10.1145/3572959","DOIUrl":"https://doi.org/10.1145/3572959","url":null,"abstract":"High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 30"},"PeriodicalIF":2.3,"publicationDate":"2023-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45485527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Introduction to Special Section on FPT’20 FPT ' 20特别部分介绍

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-02-15 DOI: 10.1145/3579850

O. Sinnen, Qiang Liu, A. Davoodi

引用次数: 0

Logic Shrinkage: Learned Connectivity Sparsification for LUT-Based Neural Networks 逻辑收缩：基于LUT的神经网络的学习连通性稀疏

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-02-10 DOI: 10.1145/3583075

Erwei Wang, Marie Auffret, G. Stavrou, P. Cheung, G. Constantinides, M. Abdelfattah, James J. Davis

{"title":"Logic Shrinkage: Learned Connectivity Sparsification for LUT-Based Neural Networks","authors":"Erwei Wang, Marie Auffret, G. Stavrou, P. Cheung, G. Constantinides, M. Abdelfattah, James J. Davis","doi":"10.1145/3583075","DOIUrl":"https://doi.org/10.1145/3583075","url":null,"abstract":"FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this article, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs than via the direct use of off-the-shelf, hand-designed networks. Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K. Choosing appropriate K a priori is challenging, and doing so at even high granularity, e.g. per layer, is a time-consuming and error-prone process that leaves FPGAs’ spatial flexibility underexploited. Furthermore, prior works see LUT inputs connected randomly, which does not guarantee a good choice of network topology. To address these issues, we propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference. By removing LUT inputs determined to be of low importance, our method increases the efficiency of the resultant accelerators. Our GPU-friendly solution to LUT input removal is capable of processing large topologies during their training with negligible slowdown. With logic shrinkage, we better the area and energy efficiency of the best-performing LUTNet implementation of the CNV network classifying CIFAR-10 by 1.54 × and 1.31 ×, respectively, while matching its accuracy. This implementation also reaches 2.71 × the area efficiency of an equally accurate, heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment of logic shrinkage results in a post-synthesis area reduction of 2.67 × vs LUTNet, allowing for implementation that was previously impossible on today’s largest FPGAs. We validate the benefits of logic shrinkage in the context of real application deployment by implementing a face mask detection DNN using BNN, LUTNet and logic-shrunk layers. Our results show that logic shrinkage results in area gains versus LUTNet (up to 1.20 ×) and equally pruned BNNs (up to 1.08 ×), along with accuracy improvements.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"1 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43758566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VCSN: Virtual Circuit-Switching Network for Flexible and Simple-to-Operate Communication in HPC FPGA Cluster VCSN: HPC FPGA集群中灵活、简单通信的虚拟电路交换网络

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-01-13 DOI: 10.1145/3579848

Tomohiro Ueno, K. Sano

{"title":"VCSN: Virtual Circuit-Switching Network for Flexible and Simple-to-Operate Communication in HPC FPGA Cluster","authors":"Tomohiro Ueno, K. Sano","doi":"10.1145/3579848","DOIUrl":"https://doi.org/10.1145/3579848","url":null,"abstract":"FPGA clusters promise to play a critical role in high-performance computing (HPC) systems in the near future due to their flexibility and high power efficiency. The operation of large-scale general-purpose FPGA clusters on which multiple users run diverse applications requires flexible network topology to be divided and reconfigured. This paper proposes Virtual Circuit-Switching Network (VCSN) that provides an arbitrarily reconfigurable network topology and simple-to-operate network system among FPGA nodes. With virtualization, user logic on FPGAs can communicate with each other as if a circuit-switching network was available. This paper demonstrates that VCSN with 100 Gbps Ethernet achieves highly-efficient point-to-point communication among FPGAs due to its unique and efficient communication protocol. We compare VCSN with a direct connection network (DCN) that connects FPGAs directly. We also show a concrete procedure to realize collective communication on an FPGA cluster with VCSN. We demonstrate that the flexible virtual topology provided by VCSN can accelerate collective communication with simple operations. Furthermore, based on experimental results, we model and estimate communication performance by DCN and VCSN in a large FPGA cluster. The result shows that VCSN has the potential to accelerate gather communication up to about 1.97 times more than DCN.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46978447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Deterministic Approach for Range-enhanced Reconfigurable Packet Classification Engine 范围增强可重构包分类引擎的确定性方法

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-01-01 DOI: 10.1145/3586577

M. Dhayalakumar, S. Mahammad

引用次数: 0

A High-Throughput, Resource-Efficient Implementation of the RoCEv2 Remote DMA Protocol and its Application RoCEv2远程DMA协议的高吞吐量、资源高效实现及其应用

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2022-12-22 DOI: 10.1145/3543176

Niklas Schelten, Fritjof Steinert, Justin Knapheide, Anton Schulte, B. Stabernack

{"title":"A High-Throughput, Resource-Efficient Implementation of the RoCEv2 Remote DMA Protocol and its Application","authors":"Niklas Schelten, Fritjof Steinert, Justin Knapheide, Anton Schulte, B. Stabernack","doi":"10.1145/3543176","DOIUrl":"https://doi.org/10.1145/3543176","url":null,"abstract":"The use of application-specific accelerators in data centers has been the state of the art for at least a decade, starting with the availability of General Purpose GPUs achieving higher performance either overall or per watt. In most cases, these accelerators are coupled via PCIe interfaces to the corresponding hosts, which leads to disadvantages in interoperability, scalability and power consumption. As a viable alternative to PCIe-attached FPGA accelerators this paper proposes standalone FPGAs as Network-attached Accelerators (NAAs). To enable reliable communication for decoupled FPGAs we present an RDMA over Converged Ethernet v2 (RoCEv2) communication stack for high-speed and low-latency data transfer integrated into a hardware framework. For NAAs to be used instead of PCIe coupled FPGAs the framework must provide similar throughput and latency with low resource usage. We show that our RoCEv2 stack is capable of achieving 100 Gb/s throughput with latencies of less than 4μs while using about 10% of the available resources on a mid-range FPGA. To evaluate the energy efficiency of our NAA architecture, we built a demonstrator with 8 NAAs for machine learning based image classification. Based on our measurements, network-attached FPGAs are a great alternative to the more energy-demanding PCIe-attached FPGA accelerators.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 23"},"PeriodicalIF":2.3,"publicationDate":"2022-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43810654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA FlexCNN:在FPGA上组成CNN加速器的端到端框架

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2022-12-20 DOI: 10.1145/3570928

Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, J. Cong

{"title":"FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA","authors":"Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, J. Cong","doi":"10.1145/3570928","DOIUrl":"https://doi.org/10.1145/3570928","url":null,"abstract":"With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: (1) the different dimensions within same-type layers, (2) the different convolution layers especially transposed and dilated convolutions, and (3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX1 representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3× performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 5.98× and 13.42× for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5× speedup for OpenPose.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2022-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42093076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Introduction to the Special Section on FPL 2020 FPL 2020特别部分介绍

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2022-12-14 DOI: 10.1145/3536336

N. Mentens, Lionel Sousa, P. Trancoso

{"title":"Introduction to the Special Section on FPL 2020","authors":"N. Mentens, Lionel Sousa, P. Trancoso","doi":"10.1145/3536336","DOIUrl":"https://doi.org/10.1145/3536336","url":null,"abstract":"The International Conference on Field Programmable Logic and Applications (FPL) was the first and remains the largest conference in the important area of field-programmable logic and reconfigurable computing. The 30th edition of FPL was scheduled to be from August 31 to September 4, 2020, in the Chalmers Conference Center in Gothenburg, Sweden, but was moved to a virtual format due to the coronavirus disease (COVID-19). From 158 submissions, the program committee selected 24 full papers and 28 short papers to be presented in the conference. The FPL Program coChairs invited the authors of the best papers to submit an extended version of their FPL published work for composing a Special Issue of the ACM Transactions on Reconfigurable Technology and Systems. Six extended articles that went through a completely new review process have been accepted to be published in this Special Issue. These articles bring new results of research efforts in reconfigurable computing, in the areas of placement and connection of nodes and hard-blocks, nearmemory processing and HBM, NoCs, and aging in FPGAs. We acknowledge the support of all reviewers, which are fundamental in the article selection process, also for giving valuable suggestions to the authors. Thanks also go to the authors who submitted articles, and to the ACM TRETS support team. We also thank Professor Deming Chen, Editor-in-Chief of ACM TRETS, for hosting this special issue. The article Exploiting HBM on FPGAs for Data Processing focuses on the potential to exploit High Bandwidth Memory (HBM) for FPGA acceleration of data analytics workloads. The authors investigate different aspects of the computation as well as data partitioning and placement. For the evaluation of the FPGA+HBM setup, the authors integrate into an in-memory database system three relevant workloads: range selection, hash join, and stochastic gradient descent. The results show large performance benefits (6–18×) of the proposed approach when compared to traditional server systems used for the same workloads justifying the use of HBM for FPGA accelerators for these workloads. The article Detailed Placement for Dedicated LUT-level FPGA Interconnect studies the impact of dedicated placement on FPGA architectures with direct connections between the Look-Up Tables (LUTs). The authors propose a novel algorithm that orchestrates different Linear Programs (LPs)","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"15 1","pages":"1 - 2"},"PeriodicalIF":2.3,"publicationDate":"2022-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43627321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0