A review on hardware accelerators for convolutional neural network-based inference engines: Strategies for performance and energy-efficiency enhancement

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems Pub Date : 2025-03-01 DOI:10.1016/j.micpro.2025.105146

Deepika S․ , Arunachalam V․ , Alex Noel Joseph Raj

{"title":"A review on hardware accelerators for convolutional neural network-based inference engines: Strategies for performance and energy-efficiency enhancement","authors":"Deepika S․ , Arunachalam V․ , Alex Noel Joseph Raj","doi":"10.1016/j.micpro.2025.105146","DOIUrl":null,"url":null,"abstract":"<div><div>In time-critical & safety-critical image classification applications, Convolutional Neural Networks (CNNs) based Inference Engines (IEs) are preferred and required to be fast, accurate, and cost-effective to meet the market demands. The self-feature extraction capabilities use millions of parameters and neurons in the stack of layers with restricted processing time. This paper reviews strategies applied in Hardware-based image classification CNN inference engines. The acceleration strategies are (1) Arithmetic Logic Unit (ALU)-based, (2) Data flow-based, and (3) Sparsity-based are considered here. Considering benchmark accuracy, the 16-bit mixed fixed/floating point could provide 99 % and 3.75 times more performance than Half-precision floating point in an application-specific CNN model. Feeding 2-dimensional or 3-dimensional data frames to the CNN layers would reuse the data. It optimizes the volume of memory usage and improves the efficiency of the processor array. The pruning of zero/near-zero valued Input Feature Maps (IFMs) and weights leads to sparsity in the data fed to the different layers. Therefore, data compression strategies and skipping the trivial computation (zero skipping approach) would reduce the complexity of the controller. There is a benchmark performance improvement of 1.17 times and 6.2 times in power efficiency compared to dense architecture. Minimizing the complexity of indexing and load balancing controller would improve the performance further.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"113 ","pages":"Article 105146"},"PeriodicalIF":1.9000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microprocessors and Microsystems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141933125000146","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

In time-critical & safety-critical image classification applications, Convolutional Neural Networks (CNNs) based Inference Engines (IEs) are preferred and required to be fast, accurate, and cost-effective to meet the market demands. The self-feature extraction capabilities use millions of parameters and neurons in the stack of layers with restricted processing time. This paper reviews strategies applied in Hardware-based image classification CNN inference engines. The acceleration strategies are (1) Arithmetic Logic Unit (ALU)-based, (2) Data flow-based, and (3) Sparsity-based are considered here. Considering benchmark accuracy, the 16-bit mixed fixed/floating point could provide 99 % and 3.75 times more performance than Half-precision floating point in an application-specific CNN model. Feeding 2-dimensional or 3-dimensional data frames to the CNN layers would reuse the data. It optimizes the volume of memory usage and improves the efficiency of the processor array. The pruning of zero/near-zero valued Input Feature Maps (IFMs) and weights leads to sparsity in the data fed to the different layers. Therefore, data compression strategies and skipping the trivial computation (zero skipping approach) would reduce the complexity of the controller. There is a benchmark performance improvement of 1.17 times and 6.2 times in power efficiency compared to dense architecture. Minimizing the complexity of indexing and load balancing controller would improve the performance further.

查看原文本刊更多论文

基于卷积神经网络的推理引擎硬件加速器综述：性能和能效提升策略

在时间紧迫的&；基于卷积神经网络（Convolutional Neural Networks, cnn）的推理引擎（Inference engine, IEs）在安全关键的图像分类应用中更受青睐，并且需要快速、准确和高性价比来满足市场需求。在有限的处理时间内，自特征提取能力使用了数以百万计的参数和神经元。本文综述了在基于硬件的图像分类CNN推理引擎中应用的策略。本文考虑了(1)基于算术逻辑单元（ALU）、(2)基于数据流和(3)基于稀疏性的加速策略。考虑到基准精度，在特定应用的CNN模型中，16位混合固定/浮点可以提供比半精度浮点高99%和3.75倍的性能。向CNN层提供二维或三维数据帧将重用这些数据。它优化了内存使用量，提高了处理器阵列的效率。零/近零值的输入特征映射（ifm）和权重的修剪导致了馈送到不同层的数据的稀疏性。因此，数据压缩策略和跳过琐碎的计算（跳零方法）将降低控制器的复杂性。与密集架构相比，基准性能提高了1.17倍，能效提高了6.2倍。最小化索引和负载平衡控制器的复杂性将进一步提高性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Microprocessors and Microsystems 工程技术-工程：电子与电气

CiteScore

6.90

自引率

3.80%

发文量

204

审稿时长

172 days

期刊介绍： Microprocessors and Microsystems: Embedded Hardware Design (MICPRO) is a journal covering all design and architectural aspects related to embedded systems hardware. This includes different embedded system hardware platforms ranging from custom hardware via reconfigurable systems and application specific processors to general purpose embedded processors. Special emphasis is put on novel complex embedded architectures, such as systems on chip (SoC), systems on a programmable/reconfigurable chip (SoPC) and multi-processor systems on a chip (MPSoC), as well as, their memory and communication methods and structures, such as network-on-chip (NoC). Design automation of such systems including methodologies, techniques, flows and tools for their design, as well as, novel designs of hardware components fall within the scope of this journal. Novel cyber-physical applications that use embedded systems are also central in this journal. While software is not in the main focus of this journal, methods of hardware/software co-design, as well as, application restructuring and mapping to embedded hardware platforms, that consider interplay between software and hardware components with emphasis on hardware, are also in the journal scope.