NN2FPGA: Optimizing CNN Inference on FPGAs With Binary Integer Programming

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2024-11-27 DOI:10.1109/TCAD.2024.3507570

Roberto Bosio;Filippo Minnella;Teodoro Urso;Mario R. Casu;Luciano Lavagno;Mihai T. Lazarescu;Paolo Pasini

{"title":"NN2FPGA: Optimizing CNN Inference on FPGAs With Binary Integer Programming","authors":"Roberto Bosio;Filippo Minnella;Teodoro Urso;Mario R. Casu;Luciano Lavagno;Mihai T. Lazarescu;Paolo Pasini","doi":"10.1109/TCAD.2024.3507570","DOIUrl":null,"url":null,"abstract":"Skip connections have emerged as a key component of modern convolutional neural networks (CNNs) for computer vision tasks, allowing for the creation of more accurate and deeper models by addressing the vanishing gradient problem. However, the existing implementations of field-programmable gate array (FPGA)-based accelerators for ResNets and MobileNetV2 often experience decreased performance and increased computational latency due to the implementation of skip blocks. This article presents a novel framework for developing deep learning models on FPGAs that focuses on skip connections, with a unique approach to reduce buffering overhead. This results in a more efficient utilization of resources in the implementation of the skip layer. The nn2fpga compiler follows a thorough set of high-level synthesis (HLS) design principles and optimization strategies, exploiting in novel ways standard techniques to effectively map skip connection-based networks into static dataflow accelerators. To maximize throughput and efficiently use the available resources, our compiler employs a fast and effective design space exploration method based on a binary integer programming model which accurately assigns FPGA resources to the network layers, to maximize global throughput under resource constraints and then minimize resources for the achieved maximum throughput. Experimental results on the CIFAR-10 and ImageNet datasets demonstrate substantial gains in throughput (<inline-formula> <tex-math>$\\mathbf {3\\times }$ </tex-math></inline-formula>–<inline-formula> <tex-math>$\\mathbf {7\\times }$ </tex-math></inline-formula> on the past HLS-based work) for ResNet8, ResNet20, and MobileNetV2 models deployed on various Xilinx FPGA boards. Notably, MobileNetV2 deployed on the ZCU102 achieves a throughput of 2115 frame per second, representing even a 10% speedup over a state-of-the-art highly optimized manual register-transfer level implementation, showing that HLS can actually improve over manual design, thanks to the faster exploration of the design space.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1807-1818"},"PeriodicalIF":2.7000,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10769518","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10769518/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Skip connections have emerged as a key component of modern convolutional neural networks (CNNs) for computer vision tasks, allowing for the creation of more accurate and deeper models by addressing the vanishing gradient problem. However, the existing implementations of field-programmable gate array (FPGA)-based accelerators for ResNets and MobileNetV2 often experience decreased performance and increased computational latency due to the implementation of skip blocks. This article presents a novel framework for developing deep learning models on FPGAs that focuses on skip connections, with a unique approach to reduce buffering overhead. This results in a more efficient utilization of resources in the implementation of the skip layer. The nn2fpga compiler follows a thorough set of high-level synthesis (HLS) design principles and optimization strategies, exploiting in novel ways standard techniques to effectively map skip connection-based networks into static dataflow accelerators. To maximize throughput and efficiently use the available resources, our compiler employs a fast and effective design space exploration method based on a binary integer programming model which accurately assigns FPGA resources to the network layers, to maximize global throughput under resource constraints and then minimize resources for the achieved maximum throughput. Experimental results on the CIFAR-10 and ImageNet datasets demonstrate substantial gains in throughput (

$\mathbf {3\times }$

–

$\mathbf {7\times }$

on the past HLS-based work) for ResNet8, ResNet20, and MobileNetV2 models deployed on various Xilinx FPGA boards. Notably, MobileNetV2 deployed on the ZCU102 achieves a throughput of 2115 frame per second, representing even a 10% speedup over a state-of-the-art highly optimized manual register-transfer level implementation, showing that HLS can actually improve over manual design, thanks to the faster exploration of the design space.

查看原文本刊更多论文

NN2FPGA：用二进制整数规划优化fpga上的CNN推理

跳跃连接已经成为用于计算机视觉任务的现代卷积神经网络（cnn）的关键组成部分，通过解决梯度消失问题，可以创建更准确、更深入的模型。然而，现有的基于现场可编程门阵列（FPGA）的ResNets和MobileNetV2加速器的实现通常会由于实现跳过块而导致性能下降和计算延迟增加。本文提出了一个新颖的框架，用于在fpga上开发深度学习模型，该框架侧重于跳过连接，并采用独特的方法来减少缓冲开销。这导致在实现跳过层时更有效地利用资源。nn2fpga编译器遵循一套完整的高级综合（HLS）设计原则和优化策略，以新颖的方式利用标准技术有效地将基于跳跃连接的网络映射到静态数据流加速器。为了实现吞吐量最大化和有效利用可用资源，我们的编译器采用了一种基于二进制整数规划模型的快速有效的设计空间探索方法，该方法将FPGA资源准确地分配到网络层，在资源约束下实现全局吞吐量最大化，然后为实现最大吞吐量而最小化资源。在CIFAR-10和ImageNet数据集上的实验结果表明，部署在各种Xilinx FPGA板上的ResNet8、ResNet20和MobileNetV2模型的吞吐量有很大的提高（在过去基于hls的工作中为$\mathbf {3\times}$ - $\mathbf {7\times}$）。值得注意的是，部署在ZCU102上的MobileNetV2实现了每秒2115帧的吞吐量，比最先进的高度优化的手动寄存器传输级实现加速了10%，这表明HLS实际上可以比手动设计改进，这要归功于对设计空间的更快探索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.