自蒸馏量化:在基于变压器的语言模型中实现高压缩率

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-07-12 DOI:10.48550/arXiv.2307.05972

James O'Neill, Sourav Dutta

{"title":"自蒸馏量化:在基于变压器的语言模型中实现高压缩率","authors":"James O'Neill, Sourav Dutta","doi":"10.48550/arXiv.2307.05972","DOIUrl":null,"url":null,"abstract":"We investigate the effects of post-training quantization and quantization-aware training on the generalization of Transformer language models. We present a new method called self-distilled quantization (SDQ) that minimizes accumulative quantization errors and outperforms baselines. We apply SDQ to multilingual models XLM-R_{\\text{Base}} and InfoXLM_{\\text{Base}} and demonstrate that both models can be reduced from 32-bit floating point weights to 8-bit integer weights while maintaining a high level of performance on the XGLUE benchmark. Our results also highlight the challenges of quantizing multilingual models, which must generalize to languages they were not fine-tuned on.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models\",\"authors\":\"James O'Neill, Sourav Dutta\",\"doi\":\"10.48550/arXiv.2307.05972\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We investigate the effects of post-training quantization and quantization-aware training on the generalization of Transformer language models. We present a new method called self-distilled quantization (SDQ) that minimizes accumulative quantization errors and outperforms baselines. We apply SDQ to multilingual models XLM-R_{\\\\text{Base}} and InfoXLM_{\\\\text{Base}} and demonstrate that both models can be reduced from 32-bit floating point weights to 8-bit integer weights while maintaining a high level of performance on the XGLUE benchmark. Our results also highlight the challenges of quantizing multilingual models, which must generalize to languages they were not fine-tuned on.\",\"PeriodicalId\":352845,\"journal\":{\"name\":\"Annual Meeting of the Association for Computational Linguistics\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annual Meeting of the Association for Computational Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2307.05972\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Meeting of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2307.05972","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们研究了训练后量化和量化感知训练对Transformer语言模型泛化的影响。我们提出了一种新的方法，称为自蒸馏量化(SDQ)最小化累积量化误差和优于基线。我们将SDQ应用于多语言模型XLM-R_{\text{Base}}和InfoXLM_{\text{Base}}，并演示了这两个模型都可以从32位浮点权值减少到8位整数权值，同时在XGLUE基准测试中保持高水平的性能。我们的结果还强调了量化多语言模型的挑战，这些模型必须推广到它们没有经过微调的语言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

We investigate the effects of post-training quantization and quantization-aware training on the generalization of Transformer language models. We present a new method called self-distilled quantization (SDQ) that minimizes accumulative quantization errors and outperforms baselines. We apply SDQ to multilingual models XLM-R_{\text{Base}} and InfoXLM_{\text{Base}} and demonstrate that both models can be reduced from 32-bit floating point weights to 8-bit integer weights while maintaining a high level of performance on the XGLUE benchmark. Our results also highlight the challenges of quantizing multilingual models, which must generalize to languages they were not fine-tuned on.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Annual Meeting of the Association for Computational Linguistics

自引率

0.00%

发文量