基于知识蒸馏和多任务学习的集成压缩语言模型

Kun Xiang, Akihiro Fujii
{"title":"基于知识蒸馏和多任务学习的集成压缩语言模型","authors":"Kun Xiang, Akihiro Fujii","doi":"10.1109/ICBIR54589.2022.9786508","DOIUrl":null,"url":null,"abstract":"The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \\times$ and $281 \\times$ speedup of inference and $19.58 \\times$ and $8.94 \\times$ fewer parameters usage, respectively.","PeriodicalId":216904,"journal":{"name":"2022 7th International Conference on Business and Industrial Research (ICBIR)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning\",\"authors\":\"Kun Xiang, Akihiro Fujii\",\"doi\":\"10.1109/ICBIR54589.2022.9786508\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \\\\times$ and $281 \\\\times$ speedup of inference and $19.58 \\\\times$ and $8.94 \\\\times$ fewer parameters usage, respectively.\",\"PeriodicalId\":216904,\"journal\":{\"name\":\"2022 7th International Conference on Business and Industrial Research (ICBIR)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 7th International Conference on Business and Industrial Research (ICBIR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICBIR54589.2022.9786508\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Business and Industrial Research (ICBIR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBIR54589.2022.9786508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

BERT等预训练语言表示模型的成功得益于其“过度参数化”的特性,导致训练耗时、计算复杂度高、对设备要求高。在各种模型压缩和加速技术中,知识蒸馏(Knowledge Distillation, KD)对预训练语言模型的压缩引起了广泛的关注。然而,KD面临的两个主要挑战是:(i)将更多的知识从教师模型转移到学生模型,而不会在加速的同时损害准确性。(ii)轻量化模型的训练速度越快,由于噪声的影响,有过拟合的风险。为了解决这些问题,我们提出了一种基于知识蒸馏的新模型,称为忒修斯-伯特引导蒸馏CNN(TBG-disCNN)。BERT-of-Theseus[1]作为教师模型,CNN作为学生模型。针对固有噪声问题,提出了协同CNN-BiLSTM作为多任务学习(MTL)的参数共享层,以捕获区域和长期依赖信息。我们的方法具有近似良好的性能,作为BERT-base模型和teacher模型,推理加速分别为$12 \times$和$281 \times$,参数使用分别减少$19.58 \times$和$8.94 \times$。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning
The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \times$ and $281 \times$ speedup of inference and $19.58 \times$ and $8.94 \times$ fewer parameters usage, respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信