ACM Transactions on Reconfigurable Technology and Systems (TRETS)最新文献_第2页

The Future of FPGA Acceleration in Datacenters and the Cloud FPGA加速在数据中心和云中的未来

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-02-04 DOI: 10.1145/3506713

C. Bobda, Joel Mandebi Mbongue, P. Chow, M. Ewais, Naif Tarafdar, Juan Camilo Vega, Ken Eguro, Dirk Koch, Suranga Handagala, M. Leeser, M. Herbordt, Hafsah Shahzad, Peter Hofste, Burkhard Ringlein, Jakub Szefer, A. Sanaullah, R. Tessier

{"title":"The Future of FPGA Acceleration in Datacenters and the Cloud","authors":"C. Bobda, Joel Mandebi Mbongue, P. Chow, M. Ewais, Naif Tarafdar, Juan Camilo Vega, Ken Eguro, Dirk Koch, Suranga Handagala, M. Leeser, M. Herbordt, Hafsah Shahzad, Peter Hofste, Burkhard Ringlein, Jakub Szefer, A. Sanaullah, R. Tessier","doi":"10.1145/3506713","DOIUrl":"https://doi.org/10.1145/3506713","url":null,"abstract":"In this article, we survey existing academic and commercial efforts to provide Field-Programmable Gate Array (FPGA) acceleration in datacenters and the cloud. The goal is a critical review of existing systems and a discussion of their evolution from single workstations with PCI-attached FPGAs in the early days of reconfigurable computing to the integration of FPGA farms in large-scale computing infrastructures. From the lessons learned, we discuss the future of FPGAs in datacenters and the cloud and assess the challenges likely to be encountered along the way. The article explores current architectures and discusses scalability and abstractions supported by operating systems, middleware, and virtualization. Hardware and software security becomes critical when infrastructure is shared among tenants with disparate backgrounds. We review the vulnerabilities of current systems and possible attack scenarios and discuss mitigation strategies, some of which impact FPGA architecture and technology. The viability of these architectures for popular applications is reviewed, with a particular focus on deep learning and scientific computing. This work draws from workshop discussions, panel sessions including the participation of experts in the reconfigurable computing field, and private discussions among these experts. These interactions have harmonized the terminology, taxonomy, and the important topics covered in this manuscript.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"494 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116540858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS 利用静态和动态分析相结合的方法提高HLS的循环并行性

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-02-04 DOI: 10.1145/3501801

Florian Dewald, Johanna Rohde, C. Hochberger, H. Mantel

{"title":"Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS","authors":"Florian Dewald, Johanna Rohde, C. Hochberger, H. Mantel","doi":"10.1145/3501801","DOIUrl":"https://doi.org/10.1145/3501801","url":null,"abstract":"High-level synthesis (HLS) can be used to create hardware accelerators for compute-intense software parts such as loop structures. Usually, this process requires significant amount of user interaction to steer kernel selection and optimizations. This can be tedious and time-consuming. In this article, we present an approach that fully autonomously finds independent loop iterations and reductions to create parallelized accelerators. We combine static analysis with information available only at runtime to maximize the parallelism exploited by the created accelerators. For loops where we see potential for parallelism, we create fully parallelized kernel implementations. If static information does not suffice to deduce independence, then we assume independence at compile time. We verify this assumption by statically created checks that are dynamically evaluated at runtime, before using the optimized kernel. Evaluating our approach, we can generate speedups for five out of seven benchmarks. With four loop iterations running in parallel, we achieve ideal speedups of up to 4× and on average speedups of 2.27×, both in comparison to an unoptimized accelerator.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128640175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression BurstZ+:通过加速压缩消除科学计算加速器的通信瓶颈

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-01-31 DOI: 10.1145/3476831

Gongjin Sun, Seongyoung Kang, S. Jun

{"title":"BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression","authors":"Gongjin Sun, Seongyoung Kang, S. Jun","doi":"10.1145/3476831","DOIUrl":"https://doi.org/10.1145/3476831","url":null,"abstract":"We present BurstZ+, an accelerator platform that eliminates the communication bottleneck between PCIe-attached scientific computing accelerators and their host servers, via hardware-optimized compression. While accelerators such as GPUs and FPGAs provide enormous computing capabilities, their effectiveness quickly deteriorates once data is larger than its on-board memory capacity, and performance becomes limited by the communication bandwidth of moving data between the host memory and accelerator. Compression has not been very useful in solving this issue due to performance and efficiency issues of compressing floating point numbers, which scientific data often consists of. BurstZ+ is an FPGA-based prototype accelerator platform which addresses the bandwidth issue via a class of novel hardware-optimized floating point compression algorithm called ZFP-V. We demonstrate that BurstZ+ can completely remove the host-side communication bottleneck for accelerators, using multiple stencil kernels with a wide range of operational intensities. Evaluated against hand-optimized implementations of kernel accelerators of the same architecture, our single-pipeline BurstZ+ prototype outperforms an accelerator without compression by almost 4×, and even an accelerator with enough memory for the entire dataset by over 2×. Furthermore, the projected performance of BurstZ+ on a future, faster FPGA scales to almost 7× that of the same accelerator without compression, whose performance is still limited by the PCIe bandwidth.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"10 Sup2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132360538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Introduction to Special Issue on FPGAs in Data Centers 数据中心fpga专题介绍

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-01-31 DOI: 10.1145/3493607

Ken Eguro, S. Neuendorffer, V. Prasanna, Hongbo Rong

{"title":"Introduction to Special Issue on FPGAs in Data Centers","authors":"Ken Eguro, S. Neuendorffer, V. Prasanna, Hongbo Rong","doi":"10.1145/3493607","DOIUrl":"https://doi.org/10.1145/3493607","url":null,"abstract":"Hardware accelerators have been used recently to augment the compute power of data centers to improve the performance of many applications, particularly to optimize latency sensitive applications. In fact, several commercial vendors offer FPGAs in their cloud platforms. This special issue of ACM Transactions on Reconfigurable Technology and Systems presents advanced research in using FPGAs in data centers. The articles present recent research in several topics including impact of terrestrial radiation; memory system optimization using FPGAs; use and management of network accessible FPGAs; virtualization and runtime resource management in using FPGAs; novel applications of FPGAs in data centers; FPGA IP cores for data center acceleration; latency, and performance tradeoffs in using FPGAs for acceleration; and communication optimization using FPGAs. In response to the call for papers, 21 papers were received. After a thorough review of these manuscripts following the ACM manuscript review guidelines, 13 papers were accepted. The papers are grouped into two issues. This issue includes 10 papers accepted for publication in this special issue. The article “Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning” by Petrica et al. presents an automatic partitioning technique to maximize the performance and scalability of FPGA-based pipeline dataflow DNN inference accelerators on computing infrastructures consisting of multi-die, network-connected FPGAs. The article “xDNN: Inference for Deep Convolutional Neural Network” by D’Alberto et al. presents an end-to-end system for deep-learning inference based on a family of specialized hardware processors synthesized on FPGAs and Convolution Neural Networks. The article “Hardware Acceleration of High-Performance Computational Flow Dynamics Using High-Bandwidth Memory-enabled Field-programmable Gate Arrays” by Nane et al. studies the potential of using FPGAs in computational flow dynamics in the context of rapid advances in reconfigurable hardware, such as the increase in on-chip memory size, increasing number of logic cells, and the integration of high-bandwidth memories on board. The article “BurstZ+: Eliminating the Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression” by Jun et al. presents an accelerator platform that eliminates the communication bottleneck between PCIe-attached scientific computing accelerators and their host servers via hardware-optimized compression. The article “Request, Coalesce, Serve, and Forget: Miss-Optimized Memory Systems for Bandwidth-Bound Cache-unfriendly Applications on FPGAs” by Asiatici et al. presents an efficient on-chip memory system for applications such as graph analytics to minimize the number of pipeline stalls. The article “NASCENT2: Generic Near-storage Sort Accelerator for Data Analytics on SmartSSD” by Salamat et al. presents an efficient algorithm for sorting of database tables via par","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116842667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

NASCENT2: Generic Near-Storage Sort Accelerator for Data Analytics on SmartSSD NASCENT2:用于SmartSSD数据分析的通用近存储排序加速器

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-01-29 DOI: 10.1145/3472769

Sahand Salamat, Hui Zhang, Y. Ki, Tajana Rosing

{"title":"NASCENT2: Generic Near-Storage Sort Accelerator for Data Analytics on SmartSSD","authors":"Sahand Salamat, Hui Zhang, Y. Ki, Tajana Rosing","doi":"10.1145/3472769","DOIUrl":"https://doi.org/10.1145/3472769","url":null,"abstract":"As the size of data generated every day grows dramatically, the computational bottleneck of computer systems has shifted toward storage devices. The interface between the storage and the computational platforms has become the main limitation due to its limited bandwidth, which does not scale when the number of storage devices increases. Interconnect networks do not provide simultaneous access to all storage devices and thus limit the performance of the system when executing independent operations on different storage devices. Offloading the computations to the storage devices eliminates the burden of data transfer from the interconnects. Near-storage computing offloads a portion of computations to the storage devices to accelerate big data applications. In this article, we propose a generic near-storage sort accelerator for data analytics, NASCENT2, which utilizes Samsung SmartSSD, an NVMe flash drive with an on-board FPGA chip that processes data in situ. NASCENT2 consists of dictionary decoder, sort, and shuffle FPGA-based accelerators to support sorting database tables based on a key column with any arbitrary data type. It exploits data partitioning applied by data processing management systems, such as SparkSQL, to breakdown the sort operations on colossal tables to multiple sort operations on smaller tables. NASCENT2 generic sort provides 2 × speedup and 15.2 × energy efficiency improvement as compared to the CPU baseline. It moreover considers the specifications of the SmartSSD (e.g., the FPGA resources, interconnect network, and solid-state drive bandwidth) to increase the scalability of computer systems as the number of storage devices increases. With 12 SmartSSDs, NASCENT2 is 9.9× (137.2 ×) faster and 7.3 × (119.2 ×) more energy efficient in sorting the largest tables of TPCC and TPCH benchmarks than the FPGA (CPU) baseline.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130247040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

xDNN: Inference for Deep Convolutional Neural Networks 深度卷积神经网络的推理

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-01-11 DOI: 10.1145/3473334

P. D'Alberto, Victor Wu, A. Ng, Rahul Nimaiyar, Elliott Delaye, Ashish Sirasao

{"title":"xDNN: Inference for Deep Convolutional Neural Networks","authors":"P. D'Alberto, Victor Wu, A. Ng, Rahul Nimaiyar, Elliott Delaye, Ashish Sirasao","doi":"10.1145/3473334","DOIUrl":"https://doi.org/10.1145/3473334","url":null,"abstract":"We present xDNN, an end-to-end system for deep-learning inference based on a family of specialized hardware processors synthesized on Field-Programmable Gate Array (FPGAs) and Convolution Neural Networks (CNN). We present a design optimized for low latency, high throughput, and high compute efficiency with no batching. The design is scalable and a parametric function of the number of multiply-accumulate units, on-chip memory hierarchy, and numerical precision. The design can produce a scale-down processor for embedded devices, replicated to produce more cores for larger devices, or resized to optimize efficiency. On Xilinx Virtex Ultrascale+ VU13P FPGA, we achieve 800 MHz that is close to the Digital Signal Processing maximum frequency and above 80% efficiency of on-chip compute resources. On top of our processor family, we present a runtime system enabling the execution of different networks for different input sizes (i.e., from 224× 224 to 2048× 1024). We present a compiler that reads CNNs from native frameworks (i.e., MXNet, Caffe, Keras, and Tensorflow), optimizes them, generates codes, and provides performance estimates. The compiler combines quantization information from the native environment and optimizations to feed the runtime with code as efficient as any hardware expert could write. We present tools partitioning a CNN into subgraphs for the division of work to CPU cores and FPGAs. Notice that the software will not change when or if the FPGA design becomes an ASIC, making our work vertical and not just a proof-of-concept FPGA project. We show experimental results for accuracy, latency, and power for several networks: In summary, we can achieve up to 4 times higher throughput, 3 times better power efficiency than the GPUs, and up to 20 times higher throughput than the latest CPUs. To our knowledge, we provide solutions faster than any previous FPGA-based solutions and comparable to any other top-of-the-shelves solutions.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125527438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

BlastFunction: A Full-stack Framework Bringing FPGA Hardware Acceleration to Cloud-native Applications BlastFunction:为云原生应用带来FPGA硬件加速的全栈框架

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-01-11 DOI: 10.1145/3472958

Andrea Damiani, Giorgia Fiscaletti, M. Bacis, Rolando Brondolin, M. Santambrogio

{"title":"BlastFunction: A Full-stack Framework Bringing FPGA Hardware Acceleration to Cloud-native Applications","authors":"Andrea Damiani, Giorgia Fiscaletti, M. Bacis, Rolando Brondolin, M. Santambrogio","doi":"10.1145/3472958","DOIUrl":"https://doi.org/10.1145/3472958","url":null,"abstract":"“Cloud-native” is the umbrella adjective describing the standard approach for developing applications that exploit cloud infrastructures’ scalability and elasticity at their best. As the application complexity and user-bases grow, designing for performance becomes a first-class engineering concern. As an answer to these needs, heterogeneous computing platforms gained widespread attention as powerful tools to continue meeting SLAs for compute-intensive cloud-native workloads. We propose BlastFunction, an FPGA-as-a-Service full-stack framework to ease FPGAs’ adoption for cloud-native workloads, integrating with the vast spectrum of fundamental cloud models. At the IaaS level, BlastFunction time-shares FPGA-based accelerators to provide multi-tenant access to accelerated resources without any code rewriting. At the PaaS level, BlastFunction accelerates functionalities leveraging the serverless model and scales functions proactively, depending on the workload’s performance. Further lowering the FPGAs’ adoption barrier, an accelerators’ registry hosts accelerated functions ready to be used within cloud-native applications, bringing the simplicity of a SaaS-like approach to the developers. After an extensive experimental campaign against state-of-the-art cloud scenarios, we show how BlastFunction leads to higher performance metrics (utilization and throughput) against native execution, with minimal latency and overhead differences. Moreover, the scaling scheme we propose outperforms the main serverless autoscaling algorithms in workload performance and scaling operation amount.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125023427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud 面向云端通用深度神经网络的统一FPGA虚拟化框架

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2021-12-28 DOI: 10.1145/3480170

Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Shiyao Li, Guangjun Ge, Kai Zhong, Kaiyuan Guo, Yu Wang, Huazhong Yang

{"title":"A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud","authors":"Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Shiyao Li, Guangjun Ge, Kai Zhong, Kaiyuan Guo, Yu Wang, Huazhong Yang","doi":"10.1145/3480170","DOIUrl":"https://doi.org/10.1145/3480170","url":null,"abstract":"INFerence-as-a-Service (INFaaS) has become a primary workload in the cloud. However, existing FPGA-based Deep Neural Network (DNN) accelerators are mainly optimized for the fastest speed of a single task, while the multi-tenancy of INFaaS has not been explored yet. As the demand for INFaaS keeps growing, simply increasing the number of FPGA-based DNN accelerators is not cost-effective, while merely sharing these single-task optimized DNN accelerators in a time-division multiplexing way could lead to poor isolation and high-performance loss for INFaaS. On the other hand, current cloud-based DNN accelerators have excessive compilation overhead, especially when scaling out to multi-FPGA systems for multi-tenant sharing, leading to unacceptable compilation costs for both offline deployment and online reconfiguration. Therefore, it is far from providing efficient and flexible FPGA virtualization for public and private cloud scenarios. Aiming to solve these problems, we propose a unified virtualization framework for general-purpose deep neural networks in the cloud, enabling multi-tenant sharing for both the Convolution Neural Network (CNN), and the Recurrent Neural Network (RNN) accelerators on a single FPGA. The isolation is enabled by introducing a two-level instruction dispatch module and a multi-core based hardware resources pool. Such designs provide isolated and runtime-programmable hardware resources, which further leads to performance isolation for multi-tenant sharing. On the other hand, to overcome the heavy re-compilation overheads, a tiling-based instruction frame package design and a two-stage static-dynamic compilation, are proposed. Only the lightweight runtime information is re-compiled with ∼1 ms overhead, thus guaranteeing the private cloud’s performance. Finally, the extensive experimental results show that the proposed virtualized solutions achieve up to 3.12× and 6.18× higher throughput in the private cloud compared with the static CNN and RNN baseline designs, respectively.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124961958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Real-Time Deep Learning OFDM Receiver 一种实时深度学习OFDM接收机

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2021-12-28 DOI: 10.1145/3494049

Stefan Brennsteiner, T. Arslan, J. Thompson, A. McCormick

{"title":"A Real-Time Deep Learning OFDM Receiver","authors":"Stefan Brennsteiner, T. Arslan, J. Thompson, A. McCormick","doi":"10.1145/3494049","DOIUrl":"https://doi.org/10.1145/3494049","url":null,"abstract":"Machine learning in the physical layer of communication systems holds the potential to improve performance and simplify design methodology. Many algorithms have been proposed; however, the model complexity is often unfeasible for real-time deployment. The real-time processing capability of these systems has not been proven yet. In this work, we propose a novel, less complex, fully connected neural network to perform channel estimation and signal detection in an orthogonal frequency division multiplexing system. The memory requirement, which is often the bottleneck for fully connected neural networks, is reduced by ≈ 27 times by applying known compression techniques in a three-step training process. Extensive experiments were performed for pruning and quantizing the weights of the neural network detector. Additionally, Huffman encoding was used on the weights to further reduce memory requirements. Based on this approach, we propose the first field-programmable gate array based, real-time capable neural network accelerator, specifically designed to accelerate the orthogonal frequency division multiplexing detector workload. The accelerator is synthesized for a Xilinx RFSoC field-programmable gate array, uses small-batch processing to increase throughput, efficiently supports branching neural networks, and implements superscalar Huffman decoders.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133461205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

AIgean: An Open Framework for Deploying Machine Learning on Heterogeneous Clusters 在异构集群上部署机器学习的开放框架

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2021-12-28 DOI: 10.1145/3482854

Naif Tarafdar, G. Di Guglielmo, P. Harris, J. Krupa, V. Loncar, D. Rankin, Nhan Tran, Zhenbin Wu, Q. Shen, P. Chow

{"title":"AIgean: An Open Framework for Deploying Machine Learning on Heterogeneous Clusters","authors":"Naif Tarafdar, G. Di Guglielmo, P. Harris, J. Krupa, V. Loncar, D. Rankin, Nhan Tran, Zhenbin Wu, Q. Shen, P. Chow","doi":"10.1145/3482854","DOIUrl":"https://doi.org/10.1145/3482854","url":null,"abstract":"AIgean, pronounced like the sea, is an open framework to build and deploy machine learning (ML) algorithms on a heterogeneous cluster of devices (CPUs and FPGAs). We leverage two open source projects: Galapagos, for multi-FPGA deployment, and hls4ml, for generating ML kernels synthesizable using Vivado HLS. AIgean provides a full end-to-end multi-FPGA/CPU implementation of a neural network. The user supplies a high-level neural network description, and our tool flow is responsible for the synthesizing of the individual layers, partitioning layers across different nodes, as well as the bridging and routing required for these layers to communicate. If the user is an expert in a particular domain and would like to tinker with the implementation details of the neural network, we define a flexible implementation stack for ML that includes the layers of Algorithms, Cluster Deployment & Communication, and Hardware. This allows the user to modify specific layers of abstraction without having to worry about components outside of their area of expertise, highlighting the modularity of AIgean. We demonstrate the effectiveness of AIgean with two use cases: an autoencoder, and ResNet-50 running across 10 and 12 FPGAs. AIgean leverages the FPGA’s strength in low-latency computing, as our implementations target batch-1 implementations.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115902959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6