FPGA-based CNN accelerator using Convolutional Processing Element to reduce idle states

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture Pub Date : 2025-06-06 DOI:10.1016/j.sysarc.2025.103468

Mohammad Dehnavi , Aran Ghasemi , Bijan Alizadeh

{"title":"FPGA-based CNN accelerator using Convolutional Processing Element to reduce idle states","authors":"Mohammad Dehnavi , Aran Ghasemi , Bijan Alizadeh","doi":"10.1016/j.sysarc.2025.103468","DOIUrl":null,"url":null,"abstract":"<div><div>Object detection has been a significant challenge in machine vision systems from the past to the present. Various hardware-based accelerators have been utilized to enhance speed efficiency. The primary objective of most of these accelerators is to minimize idle states in DSP blocks. In this paper, a new architecture based on Convolutional Processing Elements (CPEs) is proposed, wherein weights are stored, circularly shifted in an internal CPE buffer and used to generate output feature maps. In this way, the idle states of DSPs are reduced by increasing data reuse in CPEs and decreasing external memory accesses. The number of CPEs used to accelerate a CNN depends on the required speed and available hardware resources; configurations of 16, 32, 64, 128, and 256 CPEs can be utilized to accelerate a desired Convolutional Neural Network (CNN). To demonstrate the effectiveness of the proposed architecture, it is applied to the YOLOv3-Tiny object detection CNN. Experimental results show that our proposed architecture with 128 CPE cores can operate at 62.8 frames per second on an FPGA Xilinx XCKU060 with a working frequency of 200 MHz, using 16-bit fixed-point representation. This approach results in only a 1% drop in mAP while utilizing 43.2K LUTs, 94.4K FFs, 26.73 Mbits of RAM, and 1364 DSPs. Furthermore, the number of external memory chips is reduced by 67% compared to the state-of-the-art systems.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103468"},"PeriodicalIF":4.1000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125001407","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Object detection has been a significant challenge in machine vision systems from the past to the present. Various hardware-based accelerators have been utilized to enhance speed efficiency. The primary objective of most of these accelerators is to minimize idle states in DSP blocks. In this paper, a new architecture based on Convolutional Processing Elements (CPEs) is proposed, wherein weights are stored, circularly shifted in an internal CPE buffer and used to generate output feature maps. In this way, the idle states of DSPs are reduced by increasing data reuse in CPEs and decreasing external memory accesses. The number of CPEs used to accelerate a CNN depends on the required speed and available hardware resources; configurations of 16, 32, 64, 128, and 256 CPEs can be utilized to accelerate a desired Convolutional Neural Network (CNN). To demonstrate the effectiveness of the proposed architecture, it is applied to the YOLOv3-Tiny object detection CNN. Experimental results show that our proposed architecture with 128 CPE cores can operate at 62.8 frames per second on an FPGA Xilinx XCKU060 with a working frequency of 200 MHz, using 16-bit fixed-point representation. This approach results in only a 1% drop in mAP while utilizing 43.2K LUTs, 94.4K FFs, 26.73 Mbits of RAM, and 1364 DSPs. Furthermore, the number of external memory chips is reduced by 67% compared to the state-of-the-art systems.

查看原文本刊更多论文

基于fpga的CNN加速器，使用卷积处理元素减少空闲状态

从过去到现在，目标检测一直是机器视觉系统的一个重大挑战。各种基于硬件的加速器被用来提高速度效率。大多数这些加速器的主要目标是最小化DSP块中的空闲状态。本文提出了一种基于卷积处理元素（CPE）的新架构，其中权值被存储，在内部CPE缓冲区中循环移动，并用于生成输出特征映射。通过这种方式，通过增加cpe中的数据重用和减少外部内存访问来减少dsp的空闲状态。用于加速CNN的cpe数量取决于所需的速度和可用的硬件资源；16、32、64、128和256个cpe的配置可用于加速所需的卷积神经网络（CNN）。为了验证该架构的有效性，将其应用于YOLOv3-Tiny目标检测CNN。实验结果表明，该架构在Xilinx XCKU060 FPGA上工作频率为200mhz，采用16位定点表示，具有128个CPE核，可以达到每秒62.8帧的运行速度。这种方法在使用43.2K lut、94.4K ff、26.73 mb RAM和1364个dsp时，mAP仅下降1%。此外，与最先进的系统相比，外部存储芯片的数量减少了67%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Systems Architecture 工程技术-计算机：硬件

CiteScore

8.70

自引率

15.60%

发文量

226

审稿时长

46 days

期刊介绍： The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software. Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.