Zhiyuan Zhao;Yihao Chen;Pengcheng Feng;Jixing Li;Gang Chen;Rongxuan Shen;Huaxiang Lu
{"title":"A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow","authors":"Zhiyuan Zhao;Yihao Chen;Pengcheng Feng;Jixing Li;Gang Chen;Rongxuan Shen;Huaxiang Lu","doi":"10.1109/TCSI.2025.3554635","DOIUrl":null,"url":null,"abstract":"FPGA accelerators for lightweight convolutional neural networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimization. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and unified resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory and parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform using MobileNetV2 and ShuffleNetV2. Implementation results demonstrate that the proposed accelerator can save up to 68.3% of on-chip memory size with reduced off-chip memory access compared to the reference design. It achieves an impressive performance of up to 2092.4 FPS and a state-of-the-art MAC efficiency of up to 94.58%, while maintaining a high DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 7","pages":"3338-3351"},"PeriodicalIF":5.2000,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10969140/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
FPGA accelerators for lightweight convolutional neural networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimization. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and unified resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory and parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform using MobileNetV2 and ShuffleNetV2. Implementation results demonstrate that the proposed accelerator can save up to 68.3% of on-chip memory size with reduced off-chip memory access compared to the reference design. It achieves an impressive performance of up to 2092.4 FPS and a state-of-the-art MAC efficiency of up to 94.58%, while maintaining a high DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.
期刊介绍:
TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.