A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow

IF 5.2 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems I: Regular Papers Pub Date : 2025-04-17 DOI:10.1109/TCSI.2025.3554635

Zhiyuan Zhao;Yihao Chen;Pengcheng Feng;Jixing Li;Gang Chen;Rongxuan Shen;Huaxiang Lu

{"title":"A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow","authors":"Zhiyuan Zhao;Yihao Chen;Pengcheng Feng;Jixing Li;Gang Chen;Rongxuan Shen;Huaxiang Lu","doi":"10.1109/TCSI.2025.3554635","DOIUrl":null,"url":null,"abstract":"FPGA accelerators for lightweight convolutional neural networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimization. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and unified resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory and parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform using MobileNetV2 and ShuffleNetV2. Implementation results demonstrate that the proposed accelerator can save up to 68.3% of on-chip memory size with reduced off-chip memory access compared to the reference design. It achieves an impressive performance of up to 2092.4 FPS and a state-of-the-art MAC efficiency of up to 94.58%, while maintaining a high DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 7","pages":"3338-3351"},"PeriodicalIF":5.2000,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10969140/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

FPGA accelerators for lightweight convolutional neural networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimization. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and unified resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory and parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform using MobileNetV2 and ShuffleNetV2. Implementation results demonstrate that the proposed accelerator can save up to 68.3% of on-chip memory size with reduced off-chip memory access compared to the reference design. It achieves an impressive performance of up to 2092.4 FPS and a state-of-the-art MAC efficiency of up to 94.58%, while maintaining a high DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.

查看原文本刊更多论文

基于均衡数据流的轻量级cnn高吞吐量FPGA加速器

用于轻量级卷积神经网络（lwcnn）的FPGA加速器最近引起了人们的广泛关注。现有的LWCNN加速器大多集中在单计算引擎（CE）架构上进行局部优化。然而，这些设计通常由于其逐层数据流和统一的资源映射机制而遭受高片内/片外内存开销和低计算效率的困扰。为了解决这些问题，提出了一种新的基于多ce的数据流平衡加速器，通过面向内存和面向计算的优化来有效地加速LWCNN。首先，混合ce的流架构旨在最大限度地减少片外存储器访问，同时保持低成本的片上缓冲区大小。其次，在流架构中引入平衡数据流策略，通过改进有效的资源映射和缓解数据拥塞来提高计算效率。在此基础上，提出了一种基于性能模型的资源感知内存和并行分配方法，以获得更好的性能和可扩展性。该加速器在Xilinx ZC706平台上使用MobileNetV2和ShuffleNetV2进行了评估。实现结果表明，与参考设计相比，所提出的加速器可以节省高达68.3%的片上存储器大小，并减少片外存储器访问。它实现了高达2092.4 FPS的令人印象深刻的性能和高达94.58%的最先进的MAC效率，同时保持了95%的DSP利用率，因此显着优于当前的LWCNN加速器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems I: Regular Papers 工程技术-工程：电子与电气

CiteScore

9.80

自引率

11.80%

发文量

441

审稿时长

2 months

期刊介绍： TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.