Efficient knowledge distillation of teacher model to multiple student models

2021 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT) Pub Date : 2021-07-27 DOI:10.1109/IAICT52856.2021.9532543

Thrivikram Gl, Vidya Ganesh, T. Sethuraman, Satheesh K. Perepu

{"title":"Efficient knowledge distillation of teacher model to multiple student models","authors":"Thrivikram Gl, Vidya Ganesh, T. Sethuraman, Satheesh K. Perepu","doi":"10.1109/IAICT52856.2021.9532543","DOIUrl":null,"url":null,"abstract":"Deep learning models are proven to deliver satisfactory results on training a complex non-linear relationship between the set of input features and different task outputs. However, they are memory intensive and require good computational power for both training as well as inferencing. In literature one can find different model compression techniques which enables easy deployment on edge devices. Knowledge distillation is one such approach where the knowledge of complex teacher model is transferred to a lower parameter student model. However, the limitation is that the architecture of the student model should be comparable to the complex teacher model for better knowledge transfer. Due to this limitation, we cannot deploy this student model that learns from a complex and huge teacher on edge devices. In this work, we propose to use a combined student approach wherein different student models learn from a common teacher model. Further, we propose a unique loss function which will train multiple student models simultaneously. An advantage of this approach is that these student models can be as simple as possible when compared with traditional single student model and also the complex teacher model. Finally, we provide an extensive evaluation to prove that our approach can improve the overall accuracy significantly and allow a further compression by 10% when compared with generic model.","PeriodicalId":416542,"journal":{"name":"2021 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IAICT52856.2021.9532543","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Deep learning models are proven to deliver satisfactory results on training a complex non-linear relationship between the set of input features and different task outputs. However, they are memory intensive and require good computational power for both training as well as inferencing. In literature one can find different model compression techniques which enables easy deployment on edge devices. Knowledge distillation is one such approach where the knowledge of complex teacher model is transferred to a lower parameter student model. However, the limitation is that the architecture of the student model should be comparable to the complex teacher model for better knowledge transfer. Due to this limitation, we cannot deploy this student model that learns from a complex and huge teacher on edge devices. In this work, we propose to use a combined student approach wherein different student models learn from a common teacher model. Further, we propose a unique loss function which will train multiple student models simultaneously. An advantage of this approach is that these student models can be as simple as possible when compared with traditional single student model and also the complex teacher model. Finally, we provide an extensive evaluation to prove that our approach can improve the overall accuracy significantly and allow a further compression by 10% when compared with generic model.

查看原文本刊更多论文

教师模型到多学生模型的高效知识升华

深度学习模型被证明在训练输入特征集和不同任务输出之间的复杂非线性关系方面提供了令人满意的结果。然而，它们是内存密集型的，并且需要良好的计算能力来进行训练和推理。在文献中，人们可以找到不同的模型压缩技术，这些技术可以在边缘设备上轻松部署。知识蒸馏就是将复杂的教师模型中的知识转移到低参数的学生模型中的一种方法。然而，限制是学生模型的架构应该与复杂的教师模型相比较，以便更好地进行知识转移。由于这个限制，我们无法在边缘设备上部署这个从复杂而庞大的老师那里学习的学生模型。在这项工作中，我们建议使用一种组合的学生方法，其中不同的学生模型从一个共同的教师模型中学习。此外，我们提出了一个独特的损失函数，可以同时训练多个学生模型。这种方法的一个优点是，与传统的单一学生模型和复杂的教师模型相比，这些学生模型可以尽可能地简单。最后，我们提供了一个广泛的评估，以证明我们的方法可以显着提高整体精度，并且与通用模型相比，可以进一步压缩10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT)

自引率

0.00%

发文量