LRQuant: A Unified and Learnable Framework to Post-training Quantization for Transformer-based Large Foundation Models.

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-08-14 DOI:10.1109/tpami.2025.3599479

Jiaqi Zhao,Chao Zeng,Ming Wang,Linxuan Han,Yuzhang Shang,Miao Zhang,Liqiang Nie

{"title":"LRQuant: A Unified and Learnable Framework to Post-training Quantization for Transformer-based Large Foundation Models.","authors":"Jiaqi Zhao,Chao Zeng,Ming Wang,Linxuan Han,Yuzhang Shang,Miao Zhang,Liqiang Nie","doi":"10.1109/tpami.2025.3599479","DOIUrl":null,"url":null,"abstract":"Post-training quantization (PTQ) for transformer-based large foundation models (LFMs) significantly accelerates model inference and relieves memory constraints, without incurring model training. However, existing methods face three main issues: 1) The scaling factors, which are commonly used in scale reparameterization based weight-activation quantization for mitigating the quantization errors, are mostly hand-crafted defined which may lead to suboptimal results; 2) The formulation of current quantization error defined by L2-norm ignores the directional shifts after quantization; 3) Most methods are devised tailored for single scenario, i.e., only evaluated on LLMs or only designed for weight-only quantization, which lacks of a comprehensive evaluation on diverse benchmarks and a broad application scope. To address these challenges, this paper introduces a unified Learnable and Robust post-training Quantization framework for transformer based LFMs and various quantization scenarios, called LRQuant. Firstly, we consider an efficient block-wise learnable paradigm to find optimal scaling factors which are initialized by logarithmic activation equivalent and get suitable clipping range of quantization steps. In addition, we empirically find that only relying on MSE loss could hardly lead to optimal quantization results, so we reformulate the quantization error and then propose a novel loss function based on the negative logarithm of cosine similarity (NLC loss) between outputs of full-precision and quantized block. To fully investigate the potentiality of our learnable paradigm, we propose a more superior version LRQuant+. Specifically, we first propose a dynamically weighted scheme to balance MSE and NLC loss, and then devise learnable rotation vectors to further directly reduce directional gaps. In addition, we improve the block-wise optimization framework into a novel two-branch nature which jointly considers the error propagation and homologous reconstruction error. Extensive experiments demonstrate the superiority of our LRQuant and LRQuant+, as well as their unified effectiveness across various LFMs for both weight-activation and weight-only quantization, especially under challenging quantization scenarios, i.e., W4A4 and W2A16 on LLMs, ViTS, and MLLMs. Codes are available at https://github.com/zjq0455/LRQuant.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"39 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3599479","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Post-training quantization (PTQ) for transformer-based large foundation models (LFMs) significantly accelerates model inference and relieves memory constraints, without incurring model training. However, existing methods face three main issues: 1) The scaling factors, which are commonly used in scale reparameterization based weight-activation quantization for mitigating the quantization errors, are mostly hand-crafted defined which may lead to suboptimal results; 2) The formulation of current quantization error defined by L2-norm ignores the directional shifts after quantization; 3) Most methods are devised tailored for single scenario, i.e., only evaluated on LLMs or only designed for weight-only quantization, which lacks of a comprehensive evaluation on diverse benchmarks and a broad application scope. To address these challenges, this paper introduces a unified Learnable and Robust post-training Quantization framework for transformer based LFMs and various quantization scenarios, called LRQuant. Firstly, we consider an efficient block-wise learnable paradigm to find optimal scaling factors which are initialized by logarithmic activation equivalent and get suitable clipping range of quantization steps. In addition, we empirically find that only relying on MSE loss could hardly lead to optimal quantization results, so we reformulate the quantization error and then propose a novel loss function based on the negative logarithm of cosine similarity (NLC loss) between outputs of full-precision and quantized block. To fully investigate the potentiality of our learnable paradigm, we propose a more superior version LRQuant+. Specifically, we first propose a dynamically weighted scheme to balance MSE and NLC loss, and then devise learnable rotation vectors to further directly reduce directional gaps. In addition, we improve the block-wise optimization framework into a novel two-branch nature which jointly considers the error propagation and homologous reconstruction error. Extensive experiments demonstrate the superiority of our LRQuant and LRQuant+, as well as their unified effectiveness across various LFMs for both weight-activation and weight-only quantization, especially under challenging quantization scenarios, i.e., W4A4 and W2A16 on LLMs, ViTS, and MLLMs. Codes are available at https://github.com/zjq0455/LRQuant.

查看原文本刊更多论文

LRQuant：基于变压器的大型基础模型训练后量化的统一可学习框架。

基于变压器的大型基础模型（LFMs）的训练后量化（PTQ）在不进行模型训练的情况下显著加快了模型推理并减轻了内存约束。然而，现有方法面临三个主要问题：1)在基于尺度再参数化的权重激活量化中，为减小量化误差而常用的尺度因子大多是手工定义的，可能导致次优结果；2) l2范数定义的电流量化误差公式忽略了量化后的方向偏移；3)大多数方法都是针对单一场景而设计的，即仅在llm上进行评估或仅针对权重量化设计，缺乏对多种基准的综合评估，适用范围较广。为了解决这些挑战，本文为基于变压器的lfm和各种量化场景引入了一个统一的可学习和鲁棒的训练后量化框架，称为LRQuant。首先，我们考虑了一种有效的分块可学习范式，通过对数激活等效初始化找到最优比例因子，并得到合适的量化步长裁剪范围。此外，我们的经验发现，仅依靠MSE损失很难得到最优的量化结果，因此我们重新定义了量化误差，并提出了一种基于全精度和量化块输出余弦相似度（NLC损失）负对数的损失函数。为了充分研究我们的可学习范式的潜力，我们提出了一个更高级的版本LRQuant+。具体来说，我们首先提出一种动态加权方案来平衡MSE和NLC损失，然后设计可学习的旋转向量来进一步直接减小方向间隙。此外，我们将分块优化框架改进为一种新型的双分支优化框架，该框架同时考虑了误差传播和同源重构误差。大量的实验证明了我们的LRQuant和LRQuant+的优势，以及它们在不同lfm中对权重激活和仅权重量化的统一有效性，特别是在具有挑战性的量化场景下，即W4A4和W2A16在llm， vit和mllm上的应用。代码可在https://github.com/zjq0455/LRQuant上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.