{"title":"A Survey of Processing Systems for Phylogenetics and Population Genetics","authors":"Reinout Corts, Nikolaos S. Alachiotis","doi":"10.1145/3588033","DOIUrl":"https://doi.org/10.1145/3588033","url":null,"abstract":"The COVID-19 pandemic brought Bioinformatics into the spotlight, revealing that several existing methods, algorithms, and tools were not well prepared to handle large amounts of genomic data efficiently. This led to prohibitively long execution times and the need to reduce the extent of analyses to obtain results in a reasonable amount of time. In this survey, we review available high-performance computing and hardware-accelerated systems based on FPGA and GPU technology. Optimized and hardware-accelerated systems can conduct more thorough analyses considerably faster than pure software implementations, allowing to reach important conclusions in a timely manner to drive scientific discoveries. We discuss the reasons that are currently hindering high-performance solutions from being widely deployed in real-world biological analyses and describe a research direction that can pave the way to enable this.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 27"},"PeriodicalIF":2.3,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47009013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ZyPR: End-to-end Build Tool and Runtime Manager for Partial Reconfiguration of FPGA SoCs at the Edge","authors":"Alex R. Bucknall, Suhaib A. Fahmy","doi":"10.1145/3585521","DOIUrl":"https://doi.org/10.1145/3585521","url":null,"abstract":"Partial reconfiguration (PR) is a key enabler to the design and development of adaptive systems on modern Field Programmable Gate Array (FPGA) Systems-on-Chip (SoCs), allowing hardware to be adapted dynamically at runtime. Vendor-supported PR infrastructure is performance-limited and blocking, drivers entail complex memory management, and software/hardware design requires bespoke knowledge of the underlying hardware. This article presents ZyPR: a complete end-to-end framework that provides high-performance reconfiguration of hardware from within a software abstraction in the Linux userspace, automating the process of building PR applications with support for the Xilinx Zynq and Zynq UltraScale+ architectures, aimed at enabling non-expert application designers to leverage PR for edge applications. We compare ZyPR against traditional vendor tooling for PR management as well as recent open source tools that support PR under Linux. The framework provides a high-performance runtime along with low overhead for its provided abstractions. We introduce improvements to our previous work, increasing the provisioning throughput for PR bitstreams on the Zynq Ultrascale+ by 2× and 5.4× compared to Xilinx’s FPGA Manager.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 33"},"PeriodicalIF":2.3,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43056654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyegang Jun, Hanchen Ye, Hyunmin Jeong, Deming Chen
{"title":"AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis","authors":"Hyegang Jun, Hanchen Ye, Hyunmin Jeong, Deming Chen","doi":"10.1145/3572959","DOIUrl":"https://doi.org/10.1145/3572959","url":null,"abstract":"High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 30"},"PeriodicalIF":2.3,"publicationDate":"2023-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45485527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to Special Section on FPT’20","authors":"O. Sinnen, Qiang Liu, A. Davoodi","doi":"10.1145/3579850","DOIUrl":"https://doi.org/10.1145/3579850","url":null,"abstract":"Remarn","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 2"},"PeriodicalIF":2.3,"publicationDate":"2023-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44843906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erwei Wang, Marie Auffret, G. Stavrou, P. Cheung, G. Constantinides, M. Abdelfattah, James J. Davis
{"title":"Logic Shrinkage: Learned Connectivity Sparsification for LUT-Based Neural Networks","authors":"Erwei Wang, Marie Auffret, G. Stavrou, P. Cheung, G. Constantinides, M. Abdelfattah, James J. Davis","doi":"10.1145/3583075","DOIUrl":"https://doi.org/10.1145/3583075","url":null,"abstract":"FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this article, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs than via the direct use of off-the-shelf, hand-designed networks. Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K. Choosing appropriate K a priori is challenging, and doing so at even high granularity, e.g. per layer, is a time-consuming and error-prone process that leaves FPGAs’ spatial flexibility underexploited. Furthermore, prior works see LUT inputs connected randomly, which does not guarantee a good choice of network topology. To address these issues, we propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference. By removing LUT inputs determined to be of low importance, our method increases the efficiency of the resultant accelerators. Our GPU-friendly solution to LUT input removal is capable of processing large topologies during their training with negligible slowdown. With logic shrinkage, we better the area and energy efficiency of the best-performing LUTNet implementation of the CNV network classifying CIFAR-10 by 1.54 × and 1.31 ×, respectively, while matching its accuracy. This implementation also reaches 2.71 × the area efficiency of an equally accurate, heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment of logic shrinkage results in a post-synthesis area reduction of 2.67 × vs LUTNet, allowing for implementation that was previously impossible on today’s largest FPGAs. We validate the benefits of logic shrinkage in the context of real application deployment by implementing a face mask detection DNN using BNN, LUTNet and logic-shrunk layers. Our results show that logic shrinkage results in area gains versus LUTNet (up to 1.20 ×) and equally pruned BNNs (up to 1.08 ×), along with accuracy improvements.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"1 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43758566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VCSN: Virtual Circuit-Switching Network for Flexible and Simple-to-Operate Communication in HPC FPGA Cluster","authors":"Tomohiro Ueno, K. Sano","doi":"10.1145/3579848","DOIUrl":"https://doi.org/10.1145/3579848","url":null,"abstract":"FPGA clusters promise to play a critical role in high-performance computing (HPC) systems in the near future due to their flexibility and high power efficiency. The operation of large-scale general-purpose FPGA clusters on which multiple users run diverse applications requires flexible network topology to be divided and reconfigured. This paper proposes Virtual Circuit-Switching Network (VCSN) that provides an arbitrarily reconfigurable network topology and simple-to-operate network system among FPGA nodes. With virtualization, user logic on FPGAs can communicate with each other as if a circuit-switching network was available. This paper demonstrates that VCSN with 100 Gbps Ethernet achieves highly-efficient point-to-point communication among FPGAs due to its unique and efficient communication protocol. We compare VCSN with a direct connection network (DCN) that connects FPGAs directly. We also show a concrete procedure to realize collective communication on an FPGA cluster with VCSN. We demonstrate that the flexible virtual topology provided by VCSN can accelerate collective communication with simple operations. Furthermore, based on experimental results, we model and estimate communication performance by DCN and VCSN in a large FPGA cluster. The result shows that VCSN has the potential to accelerate gather communication up to about 1.97 times more than DCN.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46978447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deterministic Approach for Range-enhanced Reconfigurable Packet Classification Engine","authors":"M. Dhayalakumar, S. Mahammad","doi":"10.1145/3586577","DOIUrl":"https://doi.org/10.1145/3586577","url":null,"abstract":"","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"81 1","pages":"29:1-29:26"},"PeriodicalIF":2.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64069356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niklas Schelten, Fritjof Steinert, Justin Knapheide, Anton Schulte, B. Stabernack
{"title":"A High-Throughput, Resource-Efficient Implementation of the RoCEv2 Remote DMA Protocol and its Application","authors":"Niklas Schelten, Fritjof Steinert, Justin Knapheide, Anton Schulte, B. Stabernack","doi":"10.1145/3543176","DOIUrl":"https://doi.org/10.1145/3543176","url":null,"abstract":"The use of application-specific accelerators in data centers has been the state of the art for at least a decade, starting with the availability of General Purpose GPUs achieving higher performance either overall or per watt. In most cases, these accelerators are coupled via PCIe interfaces to the corresponding hosts, which leads to disadvantages in interoperability, scalability and power consumption. As a viable alternative to PCIe-attached FPGA accelerators this paper proposes standalone FPGAs as Network-attached Accelerators (NAAs). To enable reliable communication for decoupled FPGAs we present an RDMA over Converged Ethernet v2 (RoCEv2) communication stack for high-speed and low-latency data transfer integrated into a hardware framework. For NAAs to be used instead of PCIe coupled FPGAs the framework must provide similar throughput and latency with low resource usage. We show that our RoCEv2 stack is capable of achieving 100 Gb/s throughput with latencies of less than 4μs while using about 10% of the available resources on a mid-range FPGA. To evaluate the energy efficiency of our NAA architecture, we built a demonstrator with 8 NAAs for machine learning based image classification. Based on our measurements, network-attached FPGAs are a great alternative to the more energy-demanding PCIe-attached FPGA accelerators.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 23"},"PeriodicalIF":2.3,"publicationDate":"2022-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43810654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, J. Cong
{"title":"FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA","authors":"Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, J. Cong","doi":"10.1145/3570928","DOIUrl":"https://doi.org/10.1145/3570928","url":null,"abstract":"With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: (1) the different dimensions within same-type layers, (2) the different convolution layers especially transposed and dilated convolutions, and (3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX1 representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3× performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 5.98× and 13.42× for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5× speedup for OpenPose.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2022-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42093076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Section on FPL 2020","authors":"N. Mentens, Lionel Sousa, P. Trancoso","doi":"10.1145/3536336","DOIUrl":"https://doi.org/10.1145/3536336","url":null,"abstract":"The International Conference on Field Programmable Logic and Applications (FPL) was the first and remains the largest conference in the important area of field-programmable logic and reconfigurable computing. The 30th edition of FPL was scheduled to be from August 31 to September 4, 2020, in the Chalmers Conference Center in Gothenburg, Sweden, but was moved to a virtual format due to the coronavirus disease (COVID-19). From 158 submissions, the program committee selected 24 full papers and 28 short papers to be presented in the conference. The FPL Program coChairs invited the authors of the best papers to submit an extended version of their FPL published work for composing a Special Issue of the ACM Transactions on Reconfigurable Technology and Systems. Six extended articles that went through a completely new review process have been accepted to be published in this Special Issue. These articles bring new results of research efforts in reconfigurable computing, in the areas of placement and connection of nodes and hard-blocks, nearmemory processing and HBM, NoCs, and aging in FPGAs. We acknowledge the support of all reviewers, which are fundamental in the article selection process, also for giving valuable suggestions to the authors. Thanks also go to the authors who submitted articles, and to the ACM TRETS support team. We also thank Professor Deming Chen, Editor-in-Chief of ACM TRETS, for hosting this special issue. The article Exploiting HBM on FPGAs for Data Processing focuses on the potential to exploit High Bandwidth Memory (HBM) for FPGA acceleration of data analytics workloads. The authors investigate different aspects of the computation as well as data partitioning and placement. For the evaluation of the FPGA+HBM setup, the authors integrate into an in-memory database system three relevant workloads: range selection, hash join, and stochastic gradient descent. The results show large performance benefits (6–18×) of the proposed approach when compared to traditional server systems used for the same workloads justifying the use of HBM for FPGA accelerators for these workloads. The article Detailed Placement for Dedicated LUT-level FPGA Interconnect studies the impact of dedicated placement on FPGA architectures with direct connections between the Look-Up Tables (LUTs). The authors propose a novel algorithm that orchestrates different Linear Programs (LPs)","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"15 1","pages":"1 - 2"},"PeriodicalIF":2.3,"publicationDate":"2022-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43627321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}