利用整体稀疏性对齐在移动设备上高效推断剪枝 CNN 模型

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-17 DOI:10.1109/TPDS.2024.3462092

Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai

{"title":"利用整体稀疏性对齐在移动设备上高效推断剪枝 CNN 模型","authors":"Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai","doi":"10.1109/TPDS.2024.3462092","DOIUrl":null,"url":null,"abstract":"Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2208-2223"},"PeriodicalIF":5.6000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment\",\"authors\":\"Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai\",\"doi\":\"10.1109/TPDS.2024.3462092\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.\",\"PeriodicalId\":13257,\"journal\":{\"name\":\"IEEE Transactions on Parallel and Distributed Systems\",\"volume\":\"35 11\",\"pages\":\"2208-2223\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Parallel and Distributed Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10682058/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10682058/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

许多基于卷积神经网络的人工智能应用都直接部署在移动设备上，以避免网络不可用和用户隐私泄露。然而，由于模型参数量大幅增加，很难在计算能力有限的移动设备上实现高性能的卷积神经网络推理。权重剪枝是通过减少模型参数和计算操作来压缩模型的主要方法之一，但同时也会引入神经网络的不规则稀疏性，导致推理过程中的计算和内存访问效率低下。本研究提出了一种端到端框架，即 MCPruner，通过将稀疏模式与计算、内存访问和并行性方面的硬件执行特性相匹配，在移动设备上高效推断剪枝卷积神经网络。它首先共同设计了剪枝方法和代码生成优化方法，用于调整非零权重计数和向量宽度，以提高计算效率，同时确保准确性。在代码生成过程中，它采用了稀疏模式感知格式，以减少低效的内存访问。此外，卷积计算会重新排序以进行对齐，然后映射到加速单元上的并行线程，以实现高并行性。在基于 ARM 的 Hikey970 上使用几种常用模型和数据集的实验结果表明，我们的工作在推理效率方面优于最先进的方法，而且准确性没有降低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment

Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.