{"title":"基于知识蒸馏和多任务学习的集成压缩语言模型","authors":"Kun Xiang, Akihiro Fujii","doi":"10.1109/ICBIR54589.2022.9786508","DOIUrl":null,"url":null,"abstract":"The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \\times$ and $281 \\times$ speedup of inference and $19.58 \\times$ and $8.94 \\times$ fewer parameters usage, respectively.","PeriodicalId":216904,"journal":{"name":"2022 7th International Conference on Business and Industrial Research (ICBIR)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning\",\"authors\":\"Kun Xiang, Akihiro Fujii\",\"doi\":\"10.1109/ICBIR54589.2022.9786508\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \\\\times$ and $281 \\\\times$ speedup of inference and $19.58 \\\\times$ and $8.94 \\\\times$ fewer parameters usage, respectively.\",\"PeriodicalId\":216904,\"journal\":{\"name\":\"2022 7th International Conference on Business and Industrial Research (ICBIR)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 7th International Conference on Business and Industrial Research (ICBIR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICBIR54589.2022.9786508\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Business and Industrial Research (ICBIR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBIR54589.2022.9786508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning
The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \times$ and $281 \times$ speedup of inference and $19.58 \times$ and $8.94 \times$ fewer parameters usage, respectively.