PDD: Pruning Neural Networks During Knowledge Distillation

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Cognitive Computation Pub Date : 2024-08-31 DOI:10.1007/s12559-024-10350-9

Xi Dan, Wenjie Yang, Fuyan Zhang, Yihang Zhou, Zhuojun Yu, Zhen Qiu, Boyuan Zhao, Zeyu Dong, Libo Huang, Chuanguang Yang

{"title":"PDD: Pruning Neural Networks During Knowledge Distillation","authors":"Xi Dan, Wenjie Yang, Fuyan Zhang, Yihang Zhou, Zhuojun Yu, Zhen Qiu, Boyuan Zhao, Zeyu Dong, Libo Huang, Chuanguang Yang","doi":"10.1007/s12559-024-10350-9","DOIUrl":null,"url":null,"abstract":"<p>Although deep neural networks have developed at a high level, the large computational requirement limits the deployment in end devices. To this end, a variety of model compression and acceleration techniques have been developed. Among these, knowledge distillation has emerged as a popular approach that involves training a small student model to mimic the performance of a larger teacher model. However, the student architectures used in existing knowledge distillation are not optimal and always have redundancy, which raises questions about the validity of this assumption in practice. This study aims to investigate this assumption and empirically demonstrate that student models could contain redundancy, which can be removed through pruning without significant performance degradation. Therefore, we propose a novel pruning method to eliminate redundancy in student models. Instead of using traditional post-training pruning methods, we perform pruning during knowledge distillation (<b>PDD</b>) to prevent any loss of important information from the teacher models to the student models. This is achieved by designing a differentiable mask for each convolutional layer, which can dynamically adjust the channels to be pruned based on the loss. Experimental results show that with ResNet20 as the student model and ResNet56 as the teacher model, a 39.53%-FLOPs reduction was achieved by removing 32.77% of parameters, while the top-1 accuracy on CIFAR10 increased by 0.17%. With VGG11 as the student model and VGG16 as the teacher model, a 74.96%-FLOPs reduction was achieved by removing 76.43% of parameters, with only a loss of 1.34% in the top-1 accuracy on CIFAR10. Our code is available at https://github.com/YihangZhou0424/PDD-Pruning-during-distillation.</p>","PeriodicalId":51243,"journal":{"name":"Cognitive Computation","volume":"18 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12559-024-10350-9","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Although deep neural networks have developed at a high level, the large computational requirement limits the deployment in end devices. To this end, a variety of model compression and acceleration techniques have been developed. Among these, knowledge distillation has emerged as a popular approach that involves training a small student model to mimic the performance of a larger teacher model. However, the student architectures used in existing knowledge distillation are not optimal and always have redundancy, which raises questions about the validity of this assumption in practice. This study aims to investigate this assumption and empirically demonstrate that student models could contain redundancy, which can be removed through pruning without significant performance degradation. Therefore, we propose a novel pruning method to eliminate redundancy in student models. Instead of using traditional post-training pruning methods, we perform pruning during knowledge distillation (PDD) to prevent any loss of important information from the teacher models to the student models. This is achieved by designing a differentiable mask for each convolutional layer, which can dynamically adjust the channels to be pruned based on the loss. Experimental results show that with ResNet20 as the student model and ResNet56 as the teacher model, a 39.53%-FLOPs reduction was achieved by removing 32.77% of parameters, while the top-1 accuracy on CIFAR10 increased by 0.17%. With VGG11 as the student model and VGG16 as the teacher model, a 74.96%-FLOPs reduction was achieved by removing 76.43% of parameters, with only a loss of 1.34% in the top-1 accuracy on CIFAR10. Our code is available at https://github.com/YihangZhou0424/PDD-Pruning-during-distillation.

Abstract Image

查看原文本刊更多论文

PDD：在知识提炼过程中剪枝神经网络

虽然深度神经网络已经发展到了很高的水平，但其庞大的计算需求限制了其在终端设备中的部署。为此，人们开发了各种模型压缩和加速技术。其中，知识蒸馏已成为一种流行的方法，它包括训练一个小的学生模型来模仿一个大的教师模型的性能。然而，现有知识蒸馏中使用的学生架构并不是最优的，总是存在冗余，这就对这一假设在实践中的有效性提出了质疑。本研究旨在对这一假设进行调查，并通过实证证明学生模型可能包含冗余，而这些冗余可以通过剪枝去除，且不会明显降低性能。因此，我们提出了一种新颖的剪枝方法来消除学生模型中的冗余。我们不使用传统的训练后剪枝方法，而是在知识蒸馏（PDD）过程中进行剪枝，以防止教师模型中的重要信息流失到学生模型中。这是通过为每个卷积层设计一个可微分掩码来实现的，它可以根据损失情况动态调整要剪枝的通道。实验结果表明，以 ResNet20 作为学生模型，以 ResNet56 作为教师模型，通过去除 32.77% 的参数，实现了 39.53%-FLOPs 的缩减，而 CIFAR10 的 top-1 准确率提高了 0.17%。以 VGG11 作为学生模型，以 VGG16 作为教师模型，通过移除 76.43% 的参数，实现了 74.96%-FLOPs 的缩减，而 CIFAR10 的前 1 名准确率仅下降了 1.34%。我们的代码见 https://github.com/YihangZhou0424/PDD-Pruning-during-distillation。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cognitive Computation COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-NEUROSCIENCES

CiteScore

9.30

自引率

3.70%

发文量

116

审稿时长

>12 weeks

期刊介绍： Cognitive Computation is an international, peer-reviewed, interdisciplinary journal that publishes cutting-edge articles describing original basic and applied work involving biologically-inspired computational accounts of all aspects of natural and artificial cognitive systems. It provides a new platform for the dissemination of research, current practices and future trends in the emerging discipline of cognitive computation that bridges the gap between life sciences, social sciences, engineering, physical and mathematical sciences, and humanities.