一种高效的稀疏CNN推理加速器，具有均衡的pe内和pe间工作负载

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-12-18 DOI:10.1109/TVLSI.2024.3515217

Jianbo Guo;Tongqing Xu;Zhenyang Wu;Hao Xiao

{"title":"一种高效的稀疏CNN推理加速器，具有均衡的pe内和pe间工作负载","authors":"Jianbo Guo;Tongqing Xu;Zhenyang Wu;Hao Xiao","doi":"10.1109/TVLSI.2024.3515217","DOIUrl":null,"url":null,"abstract":"Sparse convolutional neural networks (SCNNs) which can prune trivial parameters in the network while maintaining the model accuracy has been proved to be an attractive approach to alleviate the heavy computation of convolutional neural networks (CNNs). However, the invalid data resulting from sparse patterns leads to unnecessary and irregular computation workload, which challenges the efficiency of the underlying hardware accelerators. Therefore, this article proposes an SCNN inference accelerator, which can deal with the imbalanced workload both intra- and interprocessing element (PE). A valid weight encoding (VWE) scheme is proposed to compress sparse weights into dense ones to alleviate the load imbalance intra-PE. Leveraging the VWE scheme, a randomized load rearrangement (RLR) method is proposed to dynamically schedule convolution kernels with similar sparsity into the same computation batch to alleviate the load imbalance inter-PEs. In addition, to reduce off-chip memory accesses, a recurrent weight stationary (RWS) dataflow is proposed, which adopts a small-batch and multichannel strategy to stack data from multiple channels within one off-chip access and let them compute simultaneously thereby enabling efficient reuse of on-chip data. Based on the proposed scheme, an efficient SCNN inference accelerator has been designed and verified on the field-programmable gate array (FPGA). Compared with state-of-the-art works, our design achieves <inline-formula> <tex-math>$1.16\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$2.77\\times $ </tex-math></inline-formula> higher digital signal processors (DSPs) efficiency and <inline-formula> <tex-math>$1.75\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$15\\times $ </tex-math></inline-formula> higher logic efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1278-1291"},"PeriodicalIF":2.8000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Efficient Sparse CNN Inference Accelerator With Balanced Intra- and Inter-PE Workload\",\"authors\":\"Jianbo Guo;Tongqing Xu;Zhenyang Wu;Hao Xiao\",\"doi\":\"10.1109/TVLSI.2024.3515217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sparse convolutional neural networks (SCNNs) which can prune trivial parameters in the network while maintaining the model accuracy has been proved to be an attractive approach to alleviate the heavy computation of convolutional neural networks (CNNs). However, the invalid data resulting from sparse patterns leads to unnecessary and irregular computation workload, which challenges the efficiency of the underlying hardware accelerators. Therefore, this article proposes an SCNN inference accelerator, which can deal with the imbalanced workload both intra- and interprocessing element (PE). A valid weight encoding (VWE) scheme is proposed to compress sparse weights into dense ones to alleviate the load imbalance intra-PE. Leveraging the VWE scheme, a randomized load rearrangement (RLR) method is proposed to dynamically schedule convolution kernels with similar sparsity into the same computation batch to alleviate the load imbalance inter-PEs. In addition, to reduce off-chip memory accesses, a recurrent weight stationary (RWS) dataflow is proposed, which adopts a small-batch and multichannel strategy to stack data from multiple channels within one off-chip access and let them compute simultaneously thereby enabling efficient reuse of on-chip data. Based on the proposed scheme, an efficient SCNN inference accelerator has been designed and verified on the field-programmable gate array (FPGA). Compared with state-of-the-art works, our design achieves <inline-formula> <tex-math>$1.16\\\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$2.77\\\\times $ </tex-math></inline-formula> higher digital signal processors (DSPs) efficiency and <inline-formula> <tex-math>$1.75\\\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$15\\\\times $ </tex-math></inline-formula> higher logic efficiency.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 5\",\"pages\":\"1278-1291\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-12-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10806633/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10806633/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

稀疏卷积神经网络（SCNNs）可以在保持模型精度的同时，对网络中的琐碎参数进行修剪，已被证明是缓解卷积神经网络（cnn）繁重计算的一种有吸引力的方法。然而，稀疏模式导致的无效数据导致不必要和不规则的计算工作量，这对底层硬件加速器的效率提出了挑战。为此，本文提出了一种SCNN推理加速器，该加速器可以处理处理内元和处理间元（PE）工作负载的不平衡。提出了一种有效的权值编码（VWE）方案，将稀疏权值压缩为密集权值，以缓解pe内部的负载不平衡。利用VWE方案，提出了一种随机负载重排（RLR）方法，将具有相似稀疏度的卷积核动态调度到相同的计算批中，以缓解pe间的负载不平衡。此外，为了减少片外存储器访问，提出了一种循环加权平稳（RWS）数据流，该数据流采用小批量多通道策略，将来自多个通道的数据堆栈在一个片外访问中，并让它们同时计算，从而实现片上数据的有效重用。基于该方案，设计了一个高效的SCNN推理加速器，并在现场可编程门阵列（FPGA）上进行了验证。与最先进的作品相比，我们的设计实现了1.16倍至2.77倍的数字信号处理器（dsp）效率和1.75倍至15倍的逻辑效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Efficient Sparse CNN Inference Accelerator With Balanced Intra- and Inter-PE Workload

Sparse convolutional neural networks (SCNNs) which can prune trivial parameters in the network while maintaining the model accuracy has been proved to be an attractive approach to alleviate the heavy computation of convolutional neural networks (CNNs). However, the invalid data resulting from sparse patterns leads to unnecessary and irregular computation workload, which challenges the efficiency of the underlying hardware accelerators. Therefore, this article proposes an SCNN inference accelerator, which can deal with the imbalanced workload both intra- and interprocessing element (PE). A valid weight encoding (VWE) scheme is proposed to compress sparse weights into dense ones to alleviate the load imbalance intra-PE. Leveraging the VWE scheme, a randomized load rearrangement (RLR) method is proposed to dynamically schedule convolution kernels with similar sparsity into the same computation batch to alleviate the load imbalance inter-PEs. In addition, to reduce off-chip memory accesses, a recurrent weight stationary (RWS) dataflow is proposed, which adopts a small-batch and multichannel strategy to stack data from multiple channels within one off-chip access and let them compute simultaneously thereby enabling efficient reuse of on-chip data. Based on the proposed scheme, an efficient SCNN inference accelerator has been designed and verified on the field-programmable gate array (FPGA). Compared with state-of-the-art works, our design achieves

$1.16\times $

$2.77\times $

higher digital signal processors (DSPs) efficiency and

$1.75\times $

$15\times $

higher logic efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.