xDNN: Inference for Deep Convolutional Neural Networks

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-01-11 DOI:10.1145/3473334

P. D'Alberto, Victor Wu, A. Ng, Rahul Nimaiyar, Elliott Delaye, Ashish Sirasao

{"title":"xDNN: Inference for Deep Convolutional Neural Networks","authors":"P. D'Alberto, Victor Wu, A. Ng, Rahul Nimaiyar, Elliott Delaye, Ashish Sirasao","doi":"10.1145/3473334","DOIUrl":null,"url":null,"abstract":"We present xDNN, an end-to-end system for deep-learning inference based on a family of specialized hardware processors synthesized on Field-Programmable Gate Array (FPGAs) and Convolution Neural Networks (CNN). We present a design optimized for low latency, high throughput, and high compute efficiency with no batching. The design is scalable and a parametric function of the number of multiply-accumulate units, on-chip memory hierarchy, and numerical precision. The design can produce a scale-down processor for embedded devices, replicated to produce more cores for larger devices, or resized to optimize efficiency. On Xilinx Virtex Ultrascale+ VU13P FPGA, we achieve 800 MHz that is close to the Digital Signal Processing maximum frequency and above 80% efficiency of on-chip compute resources. On top of our processor family, we present a runtime system enabling the execution of different networks for different input sizes (i.e., from 224× 224 to 2048× 1024). We present a compiler that reads CNNs from native frameworks (i.e., MXNet, Caffe, Keras, and Tensorflow), optimizes them, generates codes, and provides performance estimates. The compiler combines quantization information from the native environment and optimizations to feed the runtime with code as efficient as any hardware expert could write. We present tools partitioning a CNN into subgraphs for the division of work to CPU cores and FPGAs. Notice that the software will not change when or if the FPGA design becomes an ASIC, making our work vertical and not just a proof-of-concept FPGA project. We show experimental results for accuracy, latency, and power for several networks: In summary, we can achieve up to 4 times higher throughput, 3 times better power efficiency than the GPUs, and up to 20 times higher throughput than the latest CPUs. To our knowledge, we provide solutions faster than any previous FPGA-based solutions and comparable to any other top-of-the-shelves solutions.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3473334","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

We present xDNN, an end-to-end system for deep-learning inference based on a family of specialized hardware processors synthesized on Field-Programmable Gate Array (FPGAs) and Convolution Neural Networks (CNN). We present a design optimized for low latency, high throughput, and high compute efficiency with no batching. The design is scalable and a parametric function of the number of multiply-accumulate units, on-chip memory hierarchy, and numerical precision. The design can produce a scale-down processor for embedded devices, replicated to produce more cores for larger devices, or resized to optimize efficiency. On Xilinx Virtex Ultrascale+ VU13P FPGA, we achieve 800 MHz that is close to the Digital Signal Processing maximum frequency and above 80% efficiency of on-chip compute resources. On top of our processor family, we present a runtime system enabling the execution of different networks for different input sizes (i.e., from 224× 224 to 2048× 1024). We present a compiler that reads CNNs from native frameworks (i.e., MXNet, Caffe, Keras, and Tensorflow), optimizes them, generates codes, and provides performance estimates. The compiler combines quantization information from the native environment and optimizations to feed the runtime with code as efficient as any hardware expert could write. We present tools partitioning a CNN into subgraphs for the division of work to CPU cores and FPGAs. Notice that the software will not change when or if the FPGA design becomes an ASIC, making our work vertical and not just a proof-of-concept FPGA project. We show experimental results for accuracy, latency, and power for several networks: In summary, we can achieve up to 4 times higher throughput, 3 times better power efficiency than the GPUs, and up to 20 times higher throughput than the latest CPUs. To our knowledge, we provide solutions faster than any previous FPGA-based solutions and comparable to any other top-of-the-shelves solutions.

查看原文本刊更多论文

深度卷积神经网络的推理

我们提出了xDNN，一个端到端的深度学习推理系统，基于一系列在现场可编程门阵列(fpga)和卷积神经网络(CNN)上合成的专用硬件处理器。我们提出了一种无批处理的低延迟、高吞吐量和高计算效率的优化设计。该设计具有可扩展性，是乘累加单元数量、片上存储器层次结构和数值精度的参数函数。该设计可以为嵌入式设备生产一个缩小的处理器，复制为更大的设备生产更多的核心，或者调整大小以优化效率。在Xilinx Virtex Ultrascale+ VU13P FPGA上，我们实现了800mhz，接近数字信号处理的最高频率，片上计算资源的效率超过80%。在我们的处理器家族之上，我们提出了一个运行时系统，可以针对不同的输入大小(即从224x224到2048x1024)执行不同的网络。我们提出了一个编译器，它从本地框架(即MXNet, Caffe, Keras和Tensorflow)读取cnn，优化它们，生成代码，并提供性能估计。编译器将来自本机环境的量化信息和优化结合起来，为运行时提供任何硬件专家都能编写的高效代码。我们提出了将CNN划分为子图的工具，以便将工作划分到CPU内核和fpga。请注意，当FPGA设计成为ASIC时，软件不会改变，使我们的工作垂直而不仅仅是一个概念验证FPGA项目。我们展示了几个网络的准确性、延迟和功耗的实验结果:总之，我们可以实现比gpu高4倍的吞吐量，比gpu高3倍的功率效率，比最新的cpu高20倍的吞吐量。据我们所知，我们提供的解决方案比以前任何基于fpga的解决方案都要快，并且与任何其他顶级解决方案相媲美。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Reconfigurable Technology and Systems (TRETS)

自引率

0.00%

发文量