A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2021-12-28 DOI:10.1145/3480170

Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Shiyao Li, Guangjun Ge, Kai Zhong, Kaiyuan Guo, Yu Wang, Huazhong Yang

{"title":"A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud","authors":"Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Shiyao Li, Guangjun Ge, Kai Zhong, Kaiyuan Guo, Yu Wang, Huazhong Yang","doi":"10.1145/3480170","DOIUrl":null,"url":null,"abstract":"INFerence-as-a-Service (INFaaS) has become a primary workload in the cloud. However, existing FPGA-based Deep Neural Network (DNN) accelerators are mainly optimized for the fastest speed of a single task, while the multi-tenancy of INFaaS has not been explored yet. As the demand for INFaaS keeps growing, simply increasing the number of FPGA-based DNN accelerators is not cost-effective, while merely sharing these single-task optimized DNN accelerators in a time-division multiplexing way could lead to poor isolation and high-performance loss for INFaaS. On the other hand, current cloud-based DNN accelerators have excessive compilation overhead, especially when scaling out to multi-FPGA systems for multi-tenant sharing, leading to unacceptable compilation costs for both offline deployment and online reconfiguration. Therefore, it is far from providing efficient and flexible FPGA virtualization for public and private cloud scenarios. Aiming to solve these problems, we propose a unified virtualization framework for general-purpose deep neural networks in the cloud, enabling multi-tenant sharing for both the Convolution Neural Network (CNN), and the Recurrent Neural Network (RNN) accelerators on a single FPGA. The isolation is enabled by introducing a two-level instruction dispatch module and a multi-core based hardware resources pool. Such designs provide isolated and runtime-programmable hardware resources, which further leads to performance isolation for multi-tenant sharing. On the other hand, to overcome the heavy re-compilation overheads, a tiling-based instruction frame package design and a two-stage static-dynamic compilation, are proposed. Only the lightweight runtime information is re-compiled with ∼1 ms overhead, thus guaranteeing the private cloud’s performance. Finally, the extensive experimental results show that the proposed virtualized solutions achieve up to 3.12× and 6.18× higher throughput in the private cloud compared with the static CNN and RNN baseline designs, respectively.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3480170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

INFerence-as-a-Service (INFaaS) has become a primary workload in the cloud. However, existing FPGA-based Deep Neural Network (DNN) accelerators are mainly optimized for the fastest speed of a single task, while the multi-tenancy of INFaaS has not been explored yet. As the demand for INFaaS keeps growing, simply increasing the number of FPGA-based DNN accelerators is not cost-effective, while merely sharing these single-task optimized DNN accelerators in a time-division multiplexing way could lead to poor isolation and high-performance loss for INFaaS. On the other hand, current cloud-based DNN accelerators have excessive compilation overhead, especially when scaling out to multi-FPGA systems for multi-tenant sharing, leading to unacceptable compilation costs for both offline deployment and online reconfiguration. Therefore, it is far from providing efficient and flexible FPGA virtualization for public and private cloud scenarios. Aiming to solve these problems, we propose a unified virtualization framework for general-purpose deep neural networks in the cloud, enabling multi-tenant sharing for both the Convolution Neural Network (CNN), and the Recurrent Neural Network (RNN) accelerators on a single FPGA. The isolation is enabled by introducing a two-level instruction dispatch module and a multi-core based hardware resources pool. Such designs provide isolated and runtime-programmable hardware resources, which further leads to performance isolation for multi-tenant sharing. On the other hand, to overcome the heavy re-compilation overheads, a tiling-based instruction frame package design and a two-stage static-dynamic compilation, are proposed. Only the lightweight runtime information is re-compiled with ∼1 ms overhead, thus guaranteeing the private cloud’s performance. Finally, the extensive experimental results show that the proposed virtualized solutions achieve up to 3.12× and 6.18× higher throughput in the private cloud compared with the static CNN and RNN baseline designs, respectively.

查看原文本刊更多论文

面向云端通用深度神经网络的统一FPGA虚拟化框架

推理即服务(INFaaS)已经成为云中的主要工作负载。然而，现有的基于fpga的深度神经网络(DNN)加速器主要针对单个任务的最快速度进行优化，而INFaaS的多租户尚未探索。随着对INFaaS的需求不断增长，简单地增加基于fpga的DNN加速器的数量并不具有成本效益，而仅仅以分时复用的方式共享这些单任务优化的DNN加速器可能导致INFaaS的隔离性差和高性能损失。另一方面，当前基于云的DNN加速器具有过多的编译开销，特别是当扩展到多fpga系统以进行多租户共享时，导致离线部署和在线重新配置的编译成本都是不可接受的。因此，还远远不能为公有云和私有云场景提供高效、灵活的FPGA虚拟化。为了解决这些问题，我们为云中的通用深度神经网络提出了一个统一的虚拟化框架，使卷积神经网络(CNN)和循环神经网络(RNN)加速器在单个FPGA上实现多租户共享。隔离是通过引入两级指令分派模块和基于多核的硬件资源池来实现的。这种设计提供了隔离的和运行时可编程的硬件资源，这进一步导致了多租户共享的性能隔离。另一方面，为了克服重编译的繁重开销，提出了一种基于平铺的指令框架包设计和两阶段静态动态编译。只有轻量级运行时信息被重新编译，大约1毫秒的开销，从而保证了私有云的性能。最后，广泛的实验结果表明，与静态CNN和RNN基线设计相比，所提出的虚拟化解决方案在私有云中的吞吐量分别提高了3.12倍和6.18倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Reconfigurable Technology and Systems (TRETS)

自引率

0.00%

发文量