A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow

IF 5.2 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Zhiyuan Zhao;Yihao Chen;Pengcheng Feng;Jixing Li;Gang Chen;Rongxuan Shen;Huaxiang Lu
{"title":"A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow","authors":"Zhiyuan Zhao;Yihao Chen;Pengcheng Feng;Jixing Li;Gang Chen;Rongxuan Shen;Huaxiang Lu","doi":"10.1109/TCSI.2025.3554635","DOIUrl":null,"url":null,"abstract":"FPGA accelerators for lightweight convolutional neural networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimization. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and unified resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory and parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform using MobileNetV2 and ShuffleNetV2. Implementation results demonstrate that the proposed accelerator can save up to 68.3% of on-chip memory size with reduced off-chip memory access compared to the reference design. It achieves an impressive performance of up to 2092.4 FPS and a state-of-the-art MAC efficiency of up to 94.58%, while maintaining a high DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 7","pages":"3338-3351"},"PeriodicalIF":5.2000,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10969140/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

FPGA accelerators for lightweight convolutional neural networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimization. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and unified resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory and parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform using MobileNetV2 and ShuffleNetV2. Implementation results demonstrate that the proposed accelerator can save up to 68.3% of on-chip memory size with reduced off-chip memory access compared to the reference design. It achieves an impressive performance of up to 2092.4 FPS and a state-of-the-art MAC efficiency of up to 94.58%, while maintaining a high DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.
基于均衡数据流的轻量级cnn高吞吐量FPGA加速器
用于轻量级卷积神经网络(lwcnn)的FPGA加速器最近引起了人们的广泛关注。现有的LWCNN加速器大多集中在单计算引擎(CE)架构上进行局部优化。然而,这些设计通常由于其逐层数据流和统一的资源映射机制而遭受高片内/片外内存开销和低计算效率的困扰。为了解决这些问题,提出了一种新的基于多ce的数据流平衡加速器,通过面向内存和面向计算的优化来有效地加速LWCNN。首先,混合ce的流架构旨在最大限度地减少片外存储器访问,同时保持低成本的片上缓冲区大小。其次,在流架构中引入平衡数据流策略,通过改进有效的资源映射和缓解数据拥塞来提高计算效率。在此基础上,提出了一种基于性能模型的资源感知内存和并行分配方法,以获得更好的性能和可扩展性。该加速器在Xilinx ZC706平台上使用MobileNetV2和ShuffleNetV2进行了评估。实现结果表明,与参考设计相比,所提出的加速器可以节省高达68.3%的片上存储器大小,并减少片外存储器访问。它实现了高达2092.4 FPS的令人印象深刻的性能和高达94.58%的最先进的MAC效率,同时保持了95%的DSP利用率,因此显着优于当前的LWCNN加速器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Circuits and Systems I: Regular Papers
IEEE Transactions on Circuits and Systems I: Regular Papers 工程技术-工程:电子与电气
CiteScore
9.80
自引率
11.80%
发文量
441
审稿时长
2 months
期刊介绍: TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信