Yichuan Bai;Xiaopeng Zhang;Qian Wang;Yaqing Li;Yuan Du;Li Du
{"title":"BE-NPU:一种带宽高效的神经处理单元,具有自适应处理方案,可减少片外带宽需求","authors":"Yichuan Bai;Xiaopeng Zhang;Qian Wang;Yaqing Li;Yuan Du;Li Du","doi":"10.1109/TC.2025.3558579","DOIUrl":null,"url":null,"abstract":"Existing neural processing units (NPUs) mainly focus on the optimized multiply-accumulate (MAC) arrays for efficient inference of convolutional neural networks (CNNs). However, off-chip data transmission usually keeps NPUs waiting during CNN inference, causing up to 38.4GB/s off-chip bandwidth (OCB) demand for mobile AI devices. And none of the previous benchmarks quantitatively evaluate the bandwidth efficiency of different NPU architectures. In addition, CNNs exhibit distinct characteristics of off-chip data transmission when applied to different fields, and it has become a challenging task for NPUs to support different CNNs efficiently with reasonable OCB demand. To address the aforementioned issues, this paper proposes the Bandwidth-Peak Performance Ratio for n percentages of ideal frame rate (BPPR-n%) to demonstrate the normalized OCB demand of different NPU architectures. A bandwidth-efficient NPU (BE-NPU) is introduced with adaptive processing schemes to reduce the OCB demand during inference of different CNNs. The adaptive processing schemes include both instruction-level and thread-level schemes. For the instruction-level scheme, decoupled execute/access is introduced into depth-first (DF) and layer-first (LF) schemes to improve the concurrency between NPU calculation (CAL) and direct memory access (DMA) instructions. For the thread-level scheme, DF and LF threads are hybridly processed to further improve overall NPU efficiency. Compared with state-of-the-art works, BE-NPU achieves 48.1%∼80.6% reduction of BPPR-80% and 67.0%∼95.1% reduction of BPPR-95%. The proposed architecture is synthesized with TSMC 28nm technology node. BE-NPU utilizes 14.3% additional logic gates compared with baseline implementation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2376-2388"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BE-NPU: A Bandwidth-Efficient Neural Processing Unit With Adaptive Processing Schemes for Reduced Off-Chip Bandwidth Demand\",\"authors\":\"Yichuan Bai;Xiaopeng Zhang;Qian Wang;Yaqing Li;Yuan Du;Li Du\",\"doi\":\"10.1109/TC.2025.3558579\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing neural processing units (NPUs) mainly focus on the optimized multiply-accumulate (MAC) arrays for efficient inference of convolutional neural networks (CNNs). However, off-chip data transmission usually keeps NPUs waiting during CNN inference, causing up to 38.4GB/s off-chip bandwidth (OCB) demand for mobile AI devices. And none of the previous benchmarks quantitatively evaluate the bandwidth efficiency of different NPU architectures. In addition, CNNs exhibit distinct characteristics of off-chip data transmission when applied to different fields, and it has become a challenging task for NPUs to support different CNNs efficiently with reasonable OCB demand. To address the aforementioned issues, this paper proposes the Bandwidth-Peak Performance Ratio for n percentages of ideal frame rate (BPPR-n%) to demonstrate the normalized OCB demand of different NPU architectures. A bandwidth-efficient NPU (BE-NPU) is introduced with adaptive processing schemes to reduce the OCB demand during inference of different CNNs. The adaptive processing schemes include both instruction-level and thread-level schemes. For the instruction-level scheme, decoupled execute/access is introduced into depth-first (DF) and layer-first (LF) schemes to improve the concurrency between NPU calculation (CAL) and direct memory access (DMA) instructions. For the thread-level scheme, DF and LF threads are hybridly processed to further improve overall NPU efficiency. Compared with state-of-the-art works, BE-NPU achieves 48.1%∼80.6% reduction of BPPR-80% and 67.0%∼95.1% reduction of BPPR-95%. The proposed architecture is synthesized with TSMC 28nm technology node. BE-NPU utilizes 14.3% additional logic gates compared with baseline implementation.\",\"PeriodicalId\":13087,\"journal\":{\"name\":\"IEEE Transactions on Computers\",\"volume\":\"74 7\",\"pages\":\"2376-2388\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computers\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10955413/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10955413/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
BE-NPU: A Bandwidth-Efficient Neural Processing Unit With Adaptive Processing Schemes for Reduced Off-Chip Bandwidth Demand
Existing neural processing units (NPUs) mainly focus on the optimized multiply-accumulate (MAC) arrays for efficient inference of convolutional neural networks (CNNs). However, off-chip data transmission usually keeps NPUs waiting during CNN inference, causing up to 38.4GB/s off-chip bandwidth (OCB) demand for mobile AI devices. And none of the previous benchmarks quantitatively evaluate the bandwidth efficiency of different NPU architectures. In addition, CNNs exhibit distinct characteristics of off-chip data transmission when applied to different fields, and it has become a challenging task for NPUs to support different CNNs efficiently with reasonable OCB demand. To address the aforementioned issues, this paper proposes the Bandwidth-Peak Performance Ratio for n percentages of ideal frame rate (BPPR-n%) to demonstrate the normalized OCB demand of different NPU architectures. A bandwidth-efficient NPU (BE-NPU) is introduced with adaptive processing schemes to reduce the OCB demand during inference of different CNNs. The adaptive processing schemes include both instruction-level and thread-level schemes. For the instruction-level scheme, decoupled execute/access is introduced into depth-first (DF) and layer-first (LF) schemes to improve the concurrency between NPU calculation (CAL) and direct memory access (DMA) instructions. For the thread-level scheme, DF and LF threads are hybridly processed to further improve overall NPU efficiency. Compared with state-of-the-art works, BE-NPU achieves 48.1%∼80.6% reduction of BPPR-80% and 67.0%∼95.1% reduction of BPPR-95%. The proposed architecture is synthesized with TSMC 28nm technology node. BE-NPU utilizes 14.3% additional logic gates compared with baseline implementation.
期刊介绍:
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.