BERT Model Compression With Decoupled Knowledge Distillation And Representation Learning

Linna Zhang, Yuehui Chen, Yi Cao, Ya-ou Zhao
{"title":"BERT Model Compression With Decoupled Knowledge Distillation And Representation Learning","authors":"Linna Zhang, Yuehui Chen, Yi Cao, Ya-ou Zhao","doi":"10.1145/3573834.3574482","DOIUrl":null,"url":null,"abstract":"Pre-trained language models such as BERT have proven essential in natural language processing(NLP). However, their huge number of parameters and training cost make them very limited in practical deployment. To overcome BERT’s lack of computing resources, we propose a BERT compression method by applying decoupled knowledge distillation and representation learning, compressing the large model(teacher) into a lightweight network(student). Decoupled knowledge distillation divides the classical distillation loss into target related knowledge distillation(TRKD) and non-target related knowledge distillation(NRKD). Representation learning pools the Transformer output of each two layers, and the student network learns the intermediate features of the teacher network. It has better results on tasks of Sentiment Classification and Paraphrase Similarity Matching, retaining 98.9% performance of the large model.","PeriodicalId":345434,"journal":{"name":"Proceedings of the 4th International Conference on Advanced Information Science and System","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Advanced Information Science and System","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573834.3574482","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Pre-trained language models such as BERT have proven essential in natural language processing(NLP). However, their huge number of parameters and training cost make them very limited in practical deployment. To overcome BERT’s lack of computing resources, we propose a BERT compression method by applying decoupled knowledge distillation and representation learning, compressing the large model(teacher) into a lightweight network(student). Decoupled knowledge distillation divides the classical distillation loss into target related knowledge distillation(TRKD) and non-target related knowledge distillation(NRKD). Representation learning pools the Transformer output of each two layers, and the student network learns the intermediate features of the teacher network. It has better results on tasks of Sentiment Classification and Paraphrase Similarity Matching, retaining 98.9% performance of the large model.
解耦知识蒸馏和表示学习的BERT模型压缩
像BERT这样的预训练语言模型已经被证明在自然语言处理(NLP)中是必不可少的。然而,它们的参数数量庞大,训练成本高,在实际部署中受到很大限制。为了克服BERT计算资源不足的问题,我们提出了一种BERT压缩方法,通过应用解耦的知识蒸馏和表示学习,将大型模型(教师)压缩为轻量级网络(学生)。解耦知识蒸馏将经典的蒸馏损失分为目标相关知识蒸馏(TRKD)和非目标相关知识蒸馏(NRKD)。表示学习将每两层的Transformer输出集合在一起,学生网络学习教师网络的中间特征。在情感分类和释义相似度匹配任务上取得了较好的结果,保持了大型模型98.9%的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信