BE-NPU: A Bandwidth-Efficient Neural Processing Unit With Adaptive Processing Schemes for Reduced Off-Chip Bandwidth Demand

IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Yichuan Bai;Xiaopeng Zhang;Qian Wang;Yaqing Li;Yuan Du;Li Du
{"title":"BE-NPU: A Bandwidth-Efficient Neural Processing Unit With Adaptive Processing Schemes for Reduced Off-Chip Bandwidth Demand","authors":"Yichuan Bai;Xiaopeng Zhang;Qian Wang;Yaqing Li;Yuan Du;Li Du","doi":"10.1109/TC.2025.3558579","DOIUrl":null,"url":null,"abstract":"Existing neural processing units (NPUs) mainly focus on the optimized multiply-accumulate (MAC) arrays for efficient inference of convolutional neural networks (CNNs). However, off-chip data transmission usually keeps NPUs waiting during CNN inference, causing up to 38.4GB/s off-chip bandwidth (OCB) demand for mobile AI devices. And none of the previous benchmarks quantitatively evaluate the bandwidth efficiency of different NPU architectures. In addition, CNNs exhibit distinct characteristics of off-chip data transmission when applied to different fields, and it has become a challenging task for NPUs to support different CNNs efficiently with reasonable OCB demand. To address the aforementioned issues, this paper proposes the Bandwidth-Peak Performance Ratio for n percentages of ideal frame rate (BPPR-n%) to demonstrate the normalized OCB demand of different NPU architectures. A bandwidth-efficient NPU (BE-NPU) is introduced with adaptive processing schemes to reduce the OCB demand during inference of different CNNs. The adaptive processing schemes include both instruction-level and thread-level schemes. For the instruction-level scheme, decoupled execute/access is introduced into depth-first (DF) and layer-first (LF) schemes to improve the concurrency between NPU calculation (CAL) and direct memory access (DMA) instructions. For the thread-level scheme, DF and LF threads are hybridly processed to further improve overall NPU efficiency. Compared with state-of-the-art works, BE-NPU achieves 48.1%∼80.6% reduction of BPPR-80% and 67.0%∼95.1% reduction of BPPR-95%. The proposed architecture is synthesized with TSMC 28nm technology node. BE-NPU utilizes 14.3% additional logic gates compared with baseline implementation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2376-2388"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10955413/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Existing neural processing units (NPUs) mainly focus on the optimized multiply-accumulate (MAC) arrays for efficient inference of convolutional neural networks (CNNs). However, off-chip data transmission usually keeps NPUs waiting during CNN inference, causing up to 38.4GB/s off-chip bandwidth (OCB) demand for mobile AI devices. And none of the previous benchmarks quantitatively evaluate the bandwidth efficiency of different NPU architectures. In addition, CNNs exhibit distinct characteristics of off-chip data transmission when applied to different fields, and it has become a challenging task for NPUs to support different CNNs efficiently with reasonable OCB demand. To address the aforementioned issues, this paper proposes the Bandwidth-Peak Performance Ratio for n percentages of ideal frame rate (BPPR-n%) to demonstrate the normalized OCB demand of different NPU architectures. A bandwidth-efficient NPU (BE-NPU) is introduced with adaptive processing schemes to reduce the OCB demand during inference of different CNNs. The adaptive processing schemes include both instruction-level and thread-level schemes. For the instruction-level scheme, decoupled execute/access is introduced into depth-first (DF) and layer-first (LF) schemes to improve the concurrency between NPU calculation (CAL) and direct memory access (DMA) instructions. For the thread-level scheme, DF and LF threads are hybridly processed to further improve overall NPU efficiency. Compared with state-of-the-art works, BE-NPU achieves 48.1%∼80.6% reduction of BPPR-80% and 67.0%∼95.1% reduction of BPPR-95%. The proposed architecture is synthesized with TSMC 28nm technology node. BE-NPU utilizes 14.3% additional logic gates compared with baseline implementation.
BE-NPU:一种带宽高效的神经处理单元,具有自适应处理方案,可减少片外带宽需求
为了实现卷积神经网络的高效推理,现有的神经处理单元(npu)主要集中在优化的乘法累积(MAC)阵列上。然而,在CNN推理期间,片外数据传输通常会使npu等待,导致移动AI设备的片外带宽(OCB)需求高达38.4GB/s。之前的基准测试都没有定量评估不同NPU架构的带宽效率。此外,cnn在不同的应用领域表现出不同的片外数据传输特性,如何在合理的OCB需求下高效地支持不同的cnn已成为npu面临的一个挑战。为了解决上述问题,本文提出了n个理想帧率百分比的带宽峰值性能比(BPPR-n%)来演示不同NPU架构的归一化OCB需求。引入了一种带宽高效的NPU (BE-NPU),采用自适应处理方案来减少不同cnn推理过程中的OCB需求。自适应处理方案包括指令级和线程级两种方案。对于指令级方案,在深度优先(DF)和层优先(LF)方案中引入解耦的执行/访问,以提高NPU计算(CAL)和直接内存访问(DMA)指令之间的并发性。对于线程级方案,DF和LF线程混合处理,以进一步提高整体NPU效率。与最先进的产品相比,BE-NPU的BPPR-80%降低48.1% ~ 80.6%,BPPR-95%降低67.0% ~ 95.1%。该架构采用台积电28nm工艺节点合成。BE-NPU与基线实现相比使用了14.3%的附加逻辑门。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Computers
IEEE Transactions on Computers 工程技术-工程:电子与电气
CiteScore
6.60
自引率
5.40%
发文量
199
审稿时长
6.0 months
期刊介绍: The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信