SDSK2BERT: Explore the Specific Depth with Specific Knowledge to Compress BERT

Lifang Ding, Yujiu Yang
{"title":"SDSK2BERT: Explore the Specific Depth with Specific Knowledge to Compress BERT","authors":"Lifang Ding, Yujiu Yang","doi":"10.1109/ICBK50248.2020.00066","DOIUrl":null,"url":null,"abstract":"The success of a pretraining model like BERT in Natural Language Processing (NLP) puts forward the demand for model compression. Previous works adopting knowledge distillation (KD) to compress BERT are conducted with fixed depth, thus the problem of over-parameterization is not fully explored without answering the appropriate depth for a specific data set. In this work, we take two data sets of Natural Language Inference (NLI) with different difficulty levels as examples to answer the question of layer numbers. During the exploration of depth, we use the learned dataset-specific weights to warm up the networks in the next run, making the model find a better local optimum. With 1%~2% drops on the accuracy, our method reduces the 12-layer BERT model to 6-layer on the MNLI-matched dataset and 2-layer on the DNLI dataset, which not only reduces the parameters to 1/2x and 1/6x respectively but also outperforms the general knowledge distillation framework by about 1% accuracy. What’s more, we explain why and when our framework works with the help of visualization.","PeriodicalId":432857,"journal":{"name":"2020 IEEE International Conference on Knowledge Graph (ICKG)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Knowledge Graph (ICKG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBK50248.2020.00066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The success of a pretraining model like BERT in Natural Language Processing (NLP) puts forward the demand for model compression. Previous works adopting knowledge distillation (KD) to compress BERT are conducted with fixed depth, thus the problem of over-parameterization is not fully explored without answering the appropriate depth for a specific data set. In this work, we take two data sets of Natural Language Inference (NLI) with different difficulty levels as examples to answer the question of layer numbers. During the exploration of depth, we use the learned dataset-specific weights to warm up the networks in the next run, making the model find a better local optimum. With 1%~2% drops on the accuracy, our method reduces the 12-layer BERT model to 6-layer on the MNLI-matched dataset and 2-layer on the DNLI dataset, which not only reduces the parameters to 1/2x and 1/6x respectively but also outperforms the general knowledge distillation framework by about 1% accuracy. What’s more, we explain why and when our framework works with the help of visualization.
SDSK2BERT:用特定的知识探索特定的深度来压缩BERT
BERT等预训练模型在自然语言处理(NLP)中的成功提出了对模型压缩的需求。以往采用知识蒸馏(knowledge distillation, KD)压缩BERT的工作都是在固定深度下进行的,因此没有对特定数据集的适当深度进行回答,就没有充分探讨过参数化问题。在这项工作中,我们以两个不同难度的自然语言推理(NLI)数据集为例来回答层数的问题。在深度探索过程中,我们使用学习到的数据集特定权值在下次运行中预热网络,使模型找到更好的局部最优。在精度下降1%~2%的情况下,我们的方法将12层BERT模型在mnli匹配数据集上减少到6层,在DNLI数据集上减少到2层,不仅将参数分别减少到1/2x和1/6x,而且比一般知识蒸馏框架的精度提高了约1%。此外,我们还解释了框架在可视化的帮助下工作的原因和时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信