Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-07-12 DOI:10.48550/arXiv.2307.05972

James O'Neill, Sourav Dutta

引用次数: 0

Abstract

We investigate the effects of post-training quantization and quantization-aware training on the generalization of Transformer language models. We present a new method called self-distilled quantization (SDQ) that minimizes accumulative quantization errors and outperforms baselines. We apply SDQ to multilingual models XLM-R_{\text{Base}} and InfoXLM_{\text{Base}} and demonstrate that both models can be reduced from 32-bit floating point weights to 8-bit integer weights while maintaining a high level of performance on the XGLUE benchmark. Our results also highlight the challenges of quantizing multilingual models, which must generalize to languages they were not fine-tuned on.

查看原文本刊更多论文

自蒸馏量化:在基于变压器的语言模型中实现高压缩率

我们研究了训练后量化和量化感知训练对Transformer语言模型泛化的影响。我们提出了一种新的方法，称为自蒸馏量化(SDQ)最小化累积量化误差和优于基线。我们将SDQ应用于多语言模型XLM-R_{\text{Base}}和InfoXLM_{\text{Base}}，并演示了这两个模型都可以从32位浮点权值减少到8位整数权值，同时在XGLUE基准测试中保持高水平的性能。我们的结果还强调了量化多语言模型的挑战，这些模型必须推广到它们没有经过微调的语言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Meeting of the Association for Computational Linguistics

自引率

0.00%

发文量