SEMQ: Efficient non-uniform quantization with sensitivity-based error minimization for large language models

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-09-05 DOI:10.1016/j.future.2025.108120

Dongmin Li , Xiurui Xie , Dongyang Zhang , Athanasios V. Vasilakos , Man-Fai Leung

{"title":"SEMQ: Efficient non-uniform quantization with sensitivity-based error minimization for large language models","authors":"Dongmin Li , Xiurui Xie , Dongyang Zhang , Athanasios V. Vasilakos , Man-Fai Leung","doi":"10.1016/j.future.2025.108120","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Models (LLMs) represent a pivotal breakthrough in computational intelligence, showcasing exceptional capabilities in information aggregation and reasoning. However, their remarkable performance comes at the cost of ultra-high-scale parameters, leading to significant resource demands during deployment. Therefore, various model compression techniques have been developed, such as pruning, distillation, and quantization. Among these, quantization has gained prominence due to its ability to directly reduce the precision of model weights and activations, resulting in substantial memory savings and accelerated inference. Despite its advantages, existing quantization approaches face substantial challenges in ultra-low precisions (e.g., 2-bit), often resulting in severe performance degradation. To tackle this challenge, we propose a novel non-uniform quantization with minimal disturbance for LLM, which contains two innovations: (i) a Sensitivity-based Error Minimization Non-Uniform Quantization (SEMQ) algorithm, which finds the quantization scheme to minimize the quantization error through continuous iteration; and (ii) a Z-score-based method for outlier detection and isolation under the normal distribution assumption, reducing the complexity of the quantization process. The extensive experiments on the LLaMA family demonstrates that the proposed SEMQ enables the ultra-low precision quantization up to 2-bit, and 10<span><math><mo>×</mo></math></span> GPU memory reduction for origin LLMs while maintaining the model accuracy. Our code is publicly available at <span><span>https://github.com/ldm2060/semq</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108120"},"PeriodicalIF":6.2000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25004145","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) represent a pivotal breakthrough in computational intelligence, showcasing exceptional capabilities in information aggregation and reasoning. However, their remarkable performance comes at the cost of ultra-high-scale parameters, leading to significant resource demands during deployment. Therefore, various model compression techniques have been developed, such as pruning, distillation, and quantization. Among these, quantization has gained prominence due to its ability to directly reduce the precision of model weights and activations, resulting in substantial memory savings and accelerated inference. Despite its advantages, existing quantization approaches face substantial challenges in ultra-low precisions (e.g., 2-bit), often resulting in severe performance degradation. To tackle this challenge, we propose a novel non-uniform quantization with minimal disturbance for LLM, which contains two innovations: (i) a Sensitivity-based Error Minimization Non-Uniform Quantization (SEMQ) algorithm, which finds the quantization scheme to minimize the quantization error through continuous iteration; and (ii) a Z-score-based method for outlier detection and isolation under the normal distribution assumption, reducing the complexity of the quantization process. The extensive experiments on the LLaMA family demonstrates that the proposed SEMQ enables the ultra-low precision quantization up to 2-bit, and 10

\times

GPU memory reduction for origin LLMs while maintaining the model accuracy. Our code is publicly available at https://github.com/ldm2060/semq.

查看原文本刊更多论文

SEMQ：针对大型语言模型的基于灵敏度的误差最小化的高效非均匀量化

大型语言模型（llm）代表了计算智能的关键突破，展示了信息聚合和推理方面的卓越能力。然而，它们卓越的性能是以超高规模参数为代价的，这导致了部署期间的大量资源需求。因此，开发了各种模型压缩技术，如修剪、蒸馏和量化。其中，量化因其能够直接降低模型权重和激活的精度而获得突出地位，从而节省大量内存并加速推理。尽管现有的量化方法具有优势，但在超低精度（例如2位）方面面临着巨大的挑战，通常会导致严重的性能下降。为了解决这一挑战，我们提出了一种新的最小干扰非均匀量化方法，该方法包含两个创新：(1)基于灵敏度的误差最小化非均匀量化（SEMQ）算法，该算法通过连续迭代找到量化误差最小化的量化方案；（ii）基于z分数的方法，在正态分布假设下进行离群值检测和隔离，降低了量化过程的复杂性。在LLaMA系列上进行的大量实验表明，所提出的SEMQ可以实现高达2位的超低精度量化，并在保持模型精度的同时将原始llm的GPU内存减少10倍。我们的代码可以在https://github.com/ldm2060/semq上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.