{"title":"可进行宽度和深度动态推断的灵活 BERT 模型","authors":"Ting Hu, Christoph Meinel, Haojin Yang","doi":"10.1016/j.csl.2024.101646","DOIUrl":null,"url":null,"abstract":"<div><p>Fine-tuning and inference on Large Language Models like BERT have become increasingly expensive regarding memory cost and computation resources. The recently proposed computation-flexible BERT models facilitate their deployment in varied computational environments. Training such flexible BERT models involves jointly optimizing multiple BERT subnets, which will unavoidably interfere with one another. Besides, the performance of large subnets is limited by the performance gap between the smallest subnet and the supernet, despite efforts to enhance the smaller subnets. In this regard, we propose layer-wise Neural grafting to boost BERT subnets, especially the larger ones. The proposed method improves the average performance of the subnets on six GLUE tasks and boosts the supernets on all GLUE tasks and the SQuAD data set. Based on the boosted subnets, we further build an inference framework enabling practical width- and depth-dynamic inference regarding different inputs by combining width-dynamic gating modules and early exit off-ramps in the depth dimension. Experimental results show that the proposed framework achieves a better dynamic inference range than other methods in terms of trading off performance and computational complexity on four GLUE tasks and SQuAD. In particular, our best-tradeoff inference result outperforms other fixed-size models with similar amount of computations. Compared to BERT-Base, the proposed inference framework yields a 1.3-point improvement in the average GLUE score and a 2.2-point increase in the F1 score on SQuAD, while reducing computations by around 45%.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101646"},"PeriodicalIF":3.1000,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000299/pdfft?md5=bb08debe9a20bb5be9d04abbaca1b345&pid=1-s2.0-S0885230824000299-main.pdf","citationCount":"0","resultStr":"{\"title\":\"A flexible BERT model enabling width- and depth-dynamic inference\",\"authors\":\"Ting Hu, Christoph Meinel, Haojin Yang\",\"doi\":\"10.1016/j.csl.2024.101646\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Fine-tuning and inference on Large Language Models like BERT have become increasingly expensive regarding memory cost and computation resources. The recently proposed computation-flexible BERT models facilitate their deployment in varied computational environments. Training such flexible BERT models involves jointly optimizing multiple BERT subnets, which will unavoidably interfere with one another. Besides, the performance of large subnets is limited by the performance gap between the smallest subnet and the supernet, despite efforts to enhance the smaller subnets. In this regard, we propose layer-wise Neural grafting to boost BERT subnets, especially the larger ones. The proposed method improves the average performance of the subnets on six GLUE tasks and boosts the supernets on all GLUE tasks and the SQuAD data set. Based on the boosted subnets, we further build an inference framework enabling practical width- and depth-dynamic inference regarding different inputs by combining width-dynamic gating modules and early exit off-ramps in the depth dimension. Experimental results show that the proposed framework achieves a better dynamic inference range than other methods in terms of trading off performance and computational complexity on four GLUE tasks and SQuAD. In particular, our best-tradeoff inference result outperforms other fixed-size models with similar amount of computations. Compared to BERT-Base, the proposed inference framework yields a 1.3-point improvement in the average GLUE score and a 2.2-point increase in the F1 score on SQuAD, while reducing computations by around 45%.</p></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"87 \",\"pages\":\"Article 101646\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0885230824000299/pdfft?md5=bb08debe9a20bb5be9d04abbaca1b345&pid=1-s2.0-S0885230824000299-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230824000299\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000299","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A flexible BERT model enabling width- and depth-dynamic inference
Fine-tuning and inference on Large Language Models like BERT have become increasingly expensive regarding memory cost and computation resources. The recently proposed computation-flexible BERT models facilitate their deployment in varied computational environments. Training such flexible BERT models involves jointly optimizing multiple BERT subnets, which will unavoidably interfere with one another. Besides, the performance of large subnets is limited by the performance gap between the smallest subnet and the supernet, despite efforts to enhance the smaller subnets. In this regard, we propose layer-wise Neural grafting to boost BERT subnets, especially the larger ones. The proposed method improves the average performance of the subnets on six GLUE tasks and boosts the supernets on all GLUE tasks and the SQuAD data set. Based on the boosted subnets, we further build an inference framework enabling practical width- and depth-dynamic inference regarding different inputs by combining width-dynamic gating modules and early exit off-ramps in the depth dimension. Experimental results show that the proposed framework achieves a better dynamic inference range than other methods in terms of trading off performance and computational complexity on four GLUE tasks and SQuAD. In particular, our best-tradeoff inference result outperforms other fixed-size models with similar amount of computations. Compared to BERT-Base, the proposed inference framework yields a 1.3-point improvement in the average GLUE score and a 2.2-point increase in the F1 score on SQuAD, while reducing computations by around 45%.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.