{"title":"一种支持深度神经网络低延迟推理的40 TOPS单芯片加速器","authors":"Xun He;Tao Cao;Youjiang Liu;Le Zhong;Guoping Xiao;Cong Yu","doi":"10.1109/TCSII.2025.3563062","DOIUrl":null,"url":null,"abstract":"To achieve low latency for edge applications, a single-chip sparse accelerator is proposed, which can conduct deep neural network (DNN) inference only using limited on-chip memory. Private memory is eliminated, and all memories are shared to reduce power and chip area. An adaptive and variable-length compression algorithm is proposed to store sparse DNNs. A weak-constrained pruning algorithm is proposed to resolve load balance issue in kernel level, which can achieve almost the same sparsity as unconstrained pruning schemes (UCP). Based on these works, a low latency inference accelerator is fabricated in 28-nm CMOS with 8256 MACs and 9.4 MB on-chip SRAM, which can achieve a latency of 0.44 ms for YOLO3 tiny. For high-sparsity layers, our chip can achieve <inline-formula> <tex-math>$6.1\\times $ </tex-math></inline-formula> speedup and a throughput of 40 TOPS. With a pruned YOLO model, our accelerator achieves <inline-formula> <tex-math>$6.7\\times $ </tex-math></inline-formula> lower latency and <inline-formula> <tex-math>$21.7\\times $ </tex-math></inline-formula> better energy efficiency than Jetson Orin. A high-speed evaluation platform is built to demonstrate real-time object detection at a throughput of 600 frames per second (fps) with a power of 1.34 W.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 6","pages":"848-852"},"PeriodicalIF":4.0000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A 40 TOPS Single-Chip Accelerator Enabling Low-Latency Inference for Deep Neural Networks\",\"authors\":\"Xun He;Tao Cao;Youjiang Liu;Le Zhong;Guoping Xiao;Cong Yu\",\"doi\":\"10.1109/TCSII.2025.3563062\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To achieve low latency for edge applications, a single-chip sparse accelerator is proposed, which can conduct deep neural network (DNN) inference only using limited on-chip memory. Private memory is eliminated, and all memories are shared to reduce power and chip area. An adaptive and variable-length compression algorithm is proposed to store sparse DNNs. A weak-constrained pruning algorithm is proposed to resolve load balance issue in kernel level, which can achieve almost the same sparsity as unconstrained pruning schemes (UCP). Based on these works, a low latency inference accelerator is fabricated in 28-nm CMOS with 8256 MACs and 9.4 MB on-chip SRAM, which can achieve a latency of 0.44 ms for YOLO3 tiny. For high-sparsity layers, our chip can achieve <inline-formula> <tex-math>$6.1\\\\times $ </tex-math></inline-formula> speedup and a throughput of 40 TOPS. With a pruned YOLO model, our accelerator achieves <inline-formula> <tex-math>$6.7\\\\times $ </tex-math></inline-formula> lower latency and <inline-formula> <tex-math>$21.7\\\\times $ </tex-math></inline-formula> better energy efficiency than Jetson Orin. A high-speed evaluation platform is built to demonstrate real-time object detection at a throughput of 600 frames per second (fps) with a power of 1.34 W.\",\"PeriodicalId\":13101,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems II: Express Briefs\",\"volume\":\"72 6\",\"pages\":\"848-852\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems II: Express Briefs\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10971969/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems II: Express Briefs","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10971969/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
A 40 TOPS Single-Chip Accelerator Enabling Low-Latency Inference for Deep Neural Networks
To achieve low latency for edge applications, a single-chip sparse accelerator is proposed, which can conduct deep neural network (DNN) inference only using limited on-chip memory. Private memory is eliminated, and all memories are shared to reduce power and chip area. An adaptive and variable-length compression algorithm is proposed to store sparse DNNs. A weak-constrained pruning algorithm is proposed to resolve load balance issue in kernel level, which can achieve almost the same sparsity as unconstrained pruning schemes (UCP). Based on these works, a low latency inference accelerator is fabricated in 28-nm CMOS with 8256 MACs and 9.4 MB on-chip SRAM, which can achieve a latency of 0.44 ms for YOLO3 tiny. For high-sparsity layers, our chip can achieve $6.1\times $ speedup and a throughput of 40 TOPS. With a pruned YOLO model, our accelerator achieves $6.7\times $ lower latency and $21.7\times $ better energy efficiency than Jetson Orin. A high-speed evaluation platform is built to demonstrate real-time object detection at a throughput of 600 frames per second (fps) with a power of 1.34 W.
期刊介绍:
TCAS II publishes brief papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes:
Circuits: Analog, Digital and Mixed Signal Circuits and Systems
Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic
Circuits and Systems, Power Electronics and Systems
Software for Analog-and-Logic Circuits and Systems
Control aspects of Circuits and Systems.