SDSK2BERT: Explore the Specific Depth with Specific Knowledge to Compress BERT

2020 IEEE International Conference on Knowledge Graph (ICKG) Pub Date : 2020-08-01 DOI:10.1109/ICBK50248.2020.00066

Lifang Ding, Yujiu Yang

引用次数: 2

Abstract

The success of a pretraining model like BERT in Natural Language Processing (NLP) puts forward the demand for model compression. Previous works adopting knowledge distillation (KD) to compress BERT are conducted with fixed depth, thus the problem of over-parameterization is not fully explored without answering the appropriate depth for a specific data set. In this work, we take two data sets of Natural Language Inference (NLI) with different difficulty levels as examples to answer the question of layer numbers. During the exploration of depth, we use the learned dataset-specific weights to warm up the networks in the next run, making the model find a better local optimum. With 1%~2% drops on the accuracy, our method reduces the 12-layer BERT model to 6-layer on the MNLI-matched dataset and 2-layer on the DNLI dataset, which not only reduces the parameters to 1/2x and 1/6x respectively but also outperforms the general knowledge distillation framework by about 1% accuracy. What’s more, we explain why and when our framework works with the help of visualization.

查看原文本刊更多论文

SDSK2BERT:用特定的知识探索特定的深度来压缩BERT

BERT等预训练模型在自然语言处理(NLP)中的成功提出了对模型压缩的需求。以往采用知识蒸馏(knowledge distillation, KD)压缩BERT的工作都是在固定深度下进行的，因此没有对特定数据集的适当深度进行回答，就没有充分探讨过参数化问题。在这项工作中，我们以两个不同难度的自然语言推理(NLI)数据集为例来回答层数的问题。在深度探索过程中，我们使用学习到的数据集特定权值在下次运行中预热网络，使模型找到更好的局部最优。在精度下降1%~2%的情况下，我们的方法将12层BERT模型在mnli匹配数据集上减少到6层，在DNLI数据集上减少到2层，不仅将参数分别减少到1/2x和1/6x，而且比一般知识蒸馏框架的精度提高了约1%。此外，我们还解释了框架在可视化的帮助下工作的原因和时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Conference on Knowledge Graph (ICKG)

自引率

0.00%

发文量