Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network

IF 3.2 Q1 Computer Science

APSIPA Transactions on Signal and Information Processing Pub Date : 2021-11-17 DOI:10.1017/ATSIP.2021.16

Hsing-Hung Chou, Ching-Te Chiu, Yi-Ping Liao

{"title":"Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network","authors":"Hsing-Hung Chou, Ching-Te Chiu, Yi-Ping Liao","doi":"10.1017/ATSIP.2021.16","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficulty, we propose an efficient compression method that can be split into three parts. First, we propose a cross-layer matrix to extract more features from the teacher's model. Second, we adopt Kullback Leibler (KL) Divergence in an offline environment to make the student model find a wider robust minimum. Finally, we propose the offline ensemble pre-trained teachers to teach a student model. To address dimension mismatch between teacher and student models, we adopt a $1\\times 1$ convolution and two-stage knowledge distillation to release this constraint. We conducted experiments with VGG and ResNet models, using the CIFAR-100 dataset. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.57% with a $2.08\\times$ compression rate and 3.5x computation rate. With ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results showed that Top-1 accuracy increased by 4.38% with a $6.11\\times$ compression rate and $5.27\\times$ computation rate. In addition, we conducted experiments using the ImageNet$64\\times 64$ dataset. With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.98% with a $1.59\\times$ compression rate and $2.05\\times$ computation rate.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":" ","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2021-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"APSIPA Transactions on Signal and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/ATSIP.2021.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 1

Abstract

Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficulty, we propose an efficient compression method that can be split into three parts. First, we propose a cross-layer matrix to extract more features from the teacher's model. Second, we adopt Kullback Leibler (KL) Divergence in an offline environment to make the student model find a wider robust minimum. Finally, we propose the offline ensemble pre-trained teachers to teach a student model. To address dimension mismatch between teacher and student models, we adopt a $1\times 1$ convolution and two-stage knowledge distillation to release this constraint. We conducted experiments with VGG and ResNet models, using the CIFAR-100 dataset. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.57% with a $2.08\times$ compression rate and 3.5x computation rate. With ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results showed that Top-1 accuracy increased by 4.38% with a $6.11\times$ compression rate and $5.27\times$ computation rate. In addition, we conducted experiments using the ImageNet$64\times 64$ dataset. With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.98% with a $1.59\times$ compression rate and $2.05\times$ computation rate.

查看原文本刊更多论文

基于KL发散和离线集成的跨层知识提取压缩深度神经网络

深度神经网络（DNN）已经解决了许多任务，包括图像分类、对象检测和语义分割。然而，当存在与DNN模型相关联的巨大参数和高水平计算时，在移动设备上部署变得困难。为了解决这一困难，我们提出了一种有效的压缩方法，该方法可以分为三部分。首先，我们提出了一个跨层矩阵来从教师模型中提取更多的特征。其次，我们在离线环境中采用Kullback-Leibler（KL）散度，使学生模型找到更宽的鲁棒最小值。最后，我们提出了线下合奏预培训教师的教学模式。为了解决教师和学生模型之间的维度不匹配问题，我们采用$1\times1$卷积和两阶段知识提取来释放这种约束。我们使用CIFAR-100数据集对VGG和ResNet模型进行了实验。以VGG-11为教师模型，VGG-6为学生模型，实验结果表明，Top-1的精度提高了3.57%，压缩率为2.08倍，计算率为3.5倍。以ResNet-32为教师模型，ResNet-8为学生模型，实验结果表明，Top-1的准确率提高了4.38%，压缩率为6.11倍，计算率为5.27倍。此外，我们使用ImageNet$64\times 64$数据集进行了实验。以MobileNet-16为教师模型，MobileNet-9为学生模型，实验结果表明，Top-1的准确率提高了3.98%，压缩率为1.59倍，计算率为2.05倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊