{"title":"一种高效的稀疏CNN推理加速器,具有均衡的pe内和pe间工作负载","authors":"Jianbo Guo;Tongqing Xu;Zhenyang Wu;Hao Xiao","doi":"10.1109/TVLSI.2024.3515217","DOIUrl":null,"url":null,"abstract":"Sparse convolutional neural networks (SCNNs) which can prune trivial parameters in the network while maintaining the model accuracy has been proved to be an attractive approach to alleviate the heavy computation of convolutional neural networks (CNNs). However, the invalid data resulting from sparse patterns leads to unnecessary and irregular computation workload, which challenges the efficiency of the underlying hardware accelerators. Therefore, this article proposes an SCNN inference accelerator, which can deal with the imbalanced workload both intra- and interprocessing element (PE). A valid weight encoding (VWE) scheme is proposed to compress sparse weights into dense ones to alleviate the load imbalance intra-PE. Leveraging the VWE scheme, a randomized load rearrangement (RLR) method is proposed to dynamically schedule convolution kernels with similar sparsity into the same computation batch to alleviate the load imbalance inter-PEs. In addition, to reduce off-chip memory accesses, a recurrent weight stationary (RWS) dataflow is proposed, which adopts a small-batch and multichannel strategy to stack data from multiple channels within one off-chip access and let them compute simultaneously thereby enabling efficient reuse of on-chip data. Based on the proposed scheme, an efficient SCNN inference accelerator has been designed and verified on the field-programmable gate array (FPGA). Compared with state-of-the-art works, our design achieves <inline-formula> <tex-math>$1.16\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$2.77\\times $ </tex-math></inline-formula> higher digital signal processors (DSPs) efficiency and <inline-formula> <tex-math>$1.75\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$15\\times $ </tex-math></inline-formula> higher logic efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1278-1291"},"PeriodicalIF":2.8000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Efficient Sparse CNN Inference Accelerator With Balanced Intra- and Inter-PE Workload\",\"authors\":\"Jianbo Guo;Tongqing Xu;Zhenyang Wu;Hao Xiao\",\"doi\":\"10.1109/TVLSI.2024.3515217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sparse convolutional neural networks (SCNNs) which can prune trivial parameters in the network while maintaining the model accuracy has been proved to be an attractive approach to alleviate the heavy computation of convolutional neural networks (CNNs). However, the invalid data resulting from sparse patterns leads to unnecessary and irregular computation workload, which challenges the efficiency of the underlying hardware accelerators. Therefore, this article proposes an SCNN inference accelerator, which can deal with the imbalanced workload both intra- and interprocessing element (PE). A valid weight encoding (VWE) scheme is proposed to compress sparse weights into dense ones to alleviate the load imbalance intra-PE. Leveraging the VWE scheme, a randomized load rearrangement (RLR) method is proposed to dynamically schedule convolution kernels with similar sparsity into the same computation batch to alleviate the load imbalance inter-PEs. In addition, to reduce off-chip memory accesses, a recurrent weight stationary (RWS) dataflow is proposed, which adopts a small-batch and multichannel strategy to stack data from multiple channels within one off-chip access and let them compute simultaneously thereby enabling efficient reuse of on-chip data. Based on the proposed scheme, an efficient SCNN inference accelerator has been designed and verified on the field-programmable gate array (FPGA). Compared with state-of-the-art works, our design achieves <inline-formula> <tex-math>$1.16\\\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$2.77\\\\times $ </tex-math></inline-formula> higher digital signal processors (DSPs) efficiency and <inline-formula> <tex-math>$1.75\\\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$15\\\\times $ </tex-math></inline-formula> higher logic efficiency.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 5\",\"pages\":\"1278-1291\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-12-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10806633/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10806633/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
An Efficient Sparse CNN Inference Accelerator With Balanced Intra- and Inter-PE Workload
Sparse convolutional neural networks (SCNNs) which can prune trivial parameters in the network while maintaining the model accuracy has been proved to be an attractive approach to alleviate the heavy computation of convolutional neural networks (CNNs). However, the invalid data resulting from sparse patterns leads to unnecessary and irregular computation workload, which challenges the efficiency of the underlying hardware accelerators. Therefore, this article proposes an SCNN inference accelerator, which can deal with the imbalanced workload both intra- and interprocessing element (PE). A valid weight encoding (VWE) scheme is proposed to compress sparse weights into dense ones to alleviate the load imbalance intra-PE. Leveraging the VWE scheme, a randomized load rearrangement (RLR) method is proposed to dynamically schedule convolution kernels with similar sparsity into the same computation batch to alleviate the load imbalance inter-PEs. In addition, to reduce off-chip memory accesses, a recurrent weight stationary (RWS) dataflow is proposed, which adopts a small-batch and multichannel strategy to stack data from multiple channels within one off-chip access and let them compute simultaneously thereby enabling efficient reuse of on-chip data. Based on the proposed scheme, an efficient SCNN inference accelerator has been designed and verified on the field-programmable gate array (FPGA). Compared with state-of-the-art works, our design achieves $1.16\times $ to $2.77\times $ higher digital signal processors (DSPs) efficiency and $1.75\times $ to $15\times $ higher logic efficiency.
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.