Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network

IF 3.2 Q1 Computer Science
Hsing-Hung Chou, Ching-Te Chiu, Yi-Ping Liao
{"title":"Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network","authors":"Hsing-Hung Chou, Ching-Te Chiu, Yi-Ping Liao","doi":"10.1017/ATSIP.2021.16","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficulty, we propose an efficient compression method that can be split into three parts. First, we propose a cross-layer matrix to extract more features from the teacher's model. Second, we adopt Kullback Leibler (KL) Divergence in an offline environment to make the student model find a wider robust minimum. Finally, we propose the offline ensemble pre-trained teachers to teach a student model. To address dimension mismatch between teacher and student models, we adopt a $1\\times 1$ convolution and two-stage knowledge distillation to release this constraint. We conducted experiments with VGG and ResNet models, using the CIFAR-100 dataset. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.57% with a $2.08\\times$ compression rate and 3.5x computation rate. With ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results showed that Top-1 accuracy increased by 4.38% with a $6.11\\times$ compression rate and $5.27\\times$ computation rate. In addition, we conducted experiments using the ImageNet$64\\times 64$ dataset. With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.98% with a $1.59\\times$ compression rate and $2.05\\times$ computation rate.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2021-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"APSIPA Transactions on Signal and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/ATSIP.2021.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 1

Abstract

Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficulty, we propose an efficient compression method that can be split into three parts. First, we propose a cross-layer matrix to extract more features from the teacher's model. Second, we adopt Kullback Leibler (KL) Divergence in an offline environment to make the student model find a wider robust minimum. Finally, we propose the offline ensemble pre-trained teachers to teach a student model. To address dimension mismatch between teacher and student models, we adopt a $1\times 1$ convolution and two-stage knowledge distillation to release this constraint. We conducted experiments with VGG and ResNet models, using the CIFAR-100 dataset. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.57% with a $2.08\times$ compression rate and 3.5x computation rate. With ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results showed that Top-1 accuracy increased by 4.38% with a $6.11\times$ compression rate and $5.27\times$ computation rate. In addition, we conducted experiments using the ImageNet$64\times 64$ dataset. With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.98% with a $1.59\times$ compression rate and $2.05\times$ computation rate.
基于KL发散和离线集成的跨层知识提取压缩深度神经网络
深度神经网络(DNN)已经解决了许多任务,包括图像分类、对象检测和语义分割。然而,当存在与DNN模型相关联的巨大参数和高水平计算时,在移动设备上部署变得困难。为了解决这一困难,我们提出了一种有效的压缩方法,该方法可以分为三部分。首先,我们提出了一个跨层矩阵来从教师模型中提取更多的特征。其次,我们在离线环境中采用Kullback-Leibler(KL)散度,使学生模型找到更宽的鲁棒最小值。最后,我们提出了线下合奏预培训教师的教学模式。为了解决教师和学生模型之间的维度不匹配问题,我们采用$1\times1$卷积和两阶段知识提取来释放这种约束。我们使用CIFAR-100数据集对VGG和ResNet模型进行了实验。以VGG-11为教师模型,VGG-6为学生模型,实验结果表明,Top-1的精度提高了3.57%,压缩率为2.08倍,计算率为3.5倍。以ResNet-32为教师模型,ResNet-8为学生模型,实验结果表明,Top-1的准确率提高了4.38%,压缩率为6.11倍,计算率为5.27倍。此外,我们使用ImageNet$64\times 64$数据集进行了实验。以MobileNet-16为教师模型,MobileNet-9为学生模型,实验结果表明,Top-1的准确率提高了3.98%,压缩率为1.59倍,计算率为2.05倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
APSIPA Transactions on Signal and Information Processing
APSIPA Transactions on Signal and Information Processing ENGINEERING, ELECTRICAL & ELECTRONIC-
CiteScore
8.60
自引率
6.20%
发文量
30
审稿时长
40 weeks
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信