BE-NPU：一种带宽高效的神经处理单元，具有自适应处理方案，可减少片外带宽需求

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2025-04-08 DOI:10.1109/TC.2025.3558579

Yichuan Bai;Xiaopeng Zhang;Qian Wang;Yaqing Li;Yuan Du;Li Du

{"title":"BE-NPU：一种带宽高效的神经处理单元，具有自适应处理方案，可减少片外带宽需求","authors":"Yichuan Bai;Xiaopeng Zhang;Qian Wang;Yaqing Li;Yuan Du;Li Du","doi":"10.1109/TC.2025.3558579","DOIUrl":null,"url":null,"abstract":"Existing neural processing units (NPUs) mainly focus on the optimized multiply-accumulate (MAC) arrays for efficient inference of convolutional neural networks (CNNs). However, off-chip data transmission usually keeps NPUs waiting during CNN inference, causing up to 38.4GB/s off-chip bandwidth (OCB) demand for mobile AI devices. And none of the previous benchmarks quantitatively evaluate the bandwidth efficiency of different NPU architectures. In addition, CNNs exhibit distinct characteristics of off-chip data transmission when applied to different fields, and it has become a challenging task for NPUs to support different CNNs efficiently with reasonable OCB demand. To address the aforementioned issues, this paper proposes the Bandwidth-Peak Performance Ratio for n percentages of ideal frame rate (BPPR-n%) to demonstrate the normalized OCB demand of different NPU architectures. A bandwidth-efficient NPU (BE-NPU) is introduced with adaptive processing schemes to reduce the OCB demand during inference of different CNNs. The adaptive processing schemes include both instruction-level and thread-level schemes. For the instruction-level scheme, decoupled execute/access is introduced into depth-first (DF) and layer-first (LF) schemes to improve the concurrency between NPU calculation (CAL) and direct memory access (DMA) instructions. For the thread-level scheme, DF and LF threads are hybridly processed to further improve overall NPU efficiency. Compared with state-of-the-art works, BE-NPU achieves 48.1%∼80.6% reduction of BPPR-80% and 67.0%∼95.1% reduction of BPPR-95%. The proposed architecture is synthesized with TSMC 28nm technology node. BE-NPU utilizes 14.3% additional logic gates compared with baseline implementation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2376-2388"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BE-NPU: A Bandwidth-Efficient Neural Processing Unit With Adaptive Processing Schemes for Reduced Off-Chip Bandwidth Demand\",\"authors\":\"Yichuan Bai;Xiaopeng Zhang;Qian Wang;Yaqing Li;Yuan Du;Li Du\",\"doi\":\"10.1109/TC.2025.3558579\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing neural processing units (NPUs) mainly focus on the optimized multiply-accumulate (MAC) arrays for efficient inference of convolutional neural networks (CNNs). However, off-chip data transmission usually keeps NPUs waiting during CNN inference, causing up to 38.4GB/s off-chip bandwidth (OCB) demand for mobile AI devices. And none of the previous benchmarks quantitatively evaluate the bandwidth efficiency of different NPU architectures. In addition, CNNs exhibit distinct characteristics of off-chip data transmission when applied to different fields, and it has become a challenging task for NPUs to support different CNNs efficiently with reasonable OCB demand. To address the aforementioned issues, this paper proposes the Bandwidth-Peak Performance Ratio for n percentages of ideal frame rate (BPPR-n%) to demonstrate the normalized OCB demand of different NPU architectures. A bandwidth-efficient NPU (BE-NPU) is introduced with adaptive processing schemes to reduce the OCB demand during inference of different CNNs. The adaptive processing schemes include both instruction-level and thread-level schemes. For the instruction-level scheme, decoupled execute/access is introduced into depth-first (DF) and layer-first (LF) schemes to improve the concurrency between NPU calculation (CAL) and direct memory access (DMA) instructions. For the thread-level scheme, DF and LF threads are hybridly processed to further improve overall NPU efficiency. Compared with state-of-the-art works, BE-NPU achieves 48.1%∼80.6% reduction of BPPR-80% and 67.0%∼95.1% reduction of BPPR-95%. The proposed architecture is synthesized with TSMC 28nm technology node. BE-NPU utilizes 14.3% additional logic gates compared with baseline implementation.\",\"PeriodicalId\":13087,\"journal\":{\"name\":\"IEEE Transactions on Computers\",\"volume\":\"74 7\",\"pages\":\"2376-2388\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computers\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10955413/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10955413/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

为了实现卷积神经网络的高效推理，现有的神经处理单元（npu）主要集中在优化的乘法累积（MAC）阵列上。然而，在CNN推理期间，片外数据传输通常会使npu等待，导致移动AI设备的片外带宽（OCB）需求高达38.4GB/s。之前的基准测试都没有定量评估不同NPU架构的带宽效率。此外，cnn在不同的应用领域表现出不同的片外数据传输特性，如何在合理的OCB需求下高效地支持不同的cnn已成为npu面临的一个挑战。为了解决上述问题，本文提出了n个理想帧率百分比的带宽峰值性能比（BPPR-n%）来演示不同NPU架构的归一化OCB需求。引入了一种带宽高效的NPU (BE-NPU)，采用自适应处理方案来减少不同cnn推理过程中的OCB需求。自适应处理方案包括指令级和线程级两种方案。对于指令级方案，在深度优先（DF）和层优先（LF）方案中引入解耦的执行/访问，以提高NPU计算（CAL）和直接内存访问（DMA）指令之间的并发性。对于线程级方案，DF和LF线程混合处理，以进一步提高整体NPU效率。与最先进的产品相比，BE-NPU的BPPR-80%降低48.1% ~ 80.6%，BPPR-95%降低67.0% ~ 95.1%。该架构采用台积电28nm工艺节点合成。BE-NPU与基线实现相比使用了14.3%的附加逻辑门。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BE-NPU: A Bandwidth-Efficient Neural Processing Unit With Adaptive Processing Schemes for Reduced Off-Chip Bandwidth Demand

Existing neural processing units (NPUs) mainly focus on the optimized multiply-accumulate (MAC) arrays for efficient inference of convolutional neural networks (CNNs). However, off-chip data transmission usually keeps NPUs waiting during CNN inference, causing up to 38.4GB/s off-chip bandwidth (OCB) demand for mobile AI devices. And none of the previous benchmarks quantitatively evaluate the bandwidth efficiency of different NPU architectures. In addition, CNNs exhibit distinct characteristics of off-chip data transmission when applied to different fields, and it has become a challenging task for NPUs to support different CNNs efficiently with reasonable OCB demand. To address the aforementioned issues, this paper proposes the Bandwidth-Peak Performance Ratio for n percentages of ideal frame rate (BPPR-n%) to demonstrate the normalized OCB demand of different NPU architectures. A bandwidth-efficient NPU (BE-NPU) is introduced with adaptive processing schemes to reduce the OCB demand during inference of different CNNs. The adaptive processing schemes include both instruction-level and thread-level schemes. For the instruction-level scheme, decoupled execute/access is introduced into depth-first (DF) and layer-first (LF) schemes to improve the concurrency between NPU calculation (CAL) and direct memory access (DMA) instructions. For the thread-level scheme, DF and LF threads are hybridly processed to further improve overall NPU efficiency. Compared with state-of-the-art works, BE-NPU achieves 48.1%∼80.6% reduction of BPPR-80% and 67.0%∼95.1% reduction of BPPR-95%. The proposed architecture is synthesized with TSMC 28nm technology node. BE-NPU utilizes 14.3% additional logic gates compared with baseline implementation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.