Dongmin Li , Xiurui Xie , Dongyang Zhang , Athanasios V. Vasilakos , Man-Fai Leung
{"title":"SEMQ: Efficient non-uniform quantization with sensitivity-based error minimization for large language models","authors":"Dongmin Li , Xiurui Xie , Dongyang Zhang , Athanasios V. Vasilakos , Man-Fai Leung","doi":"10.1016/j.future.2025.108120","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Models (LLMs) represent a pivotal breakthrough in computational intelligence, showcasing exceptional capabilities in information aggregation and reasoning. However, their remarkable performance comes at the cost of ultra-high-scale parameters, leading to significant resource demands during deployment. Therefore, various model compression techniques have been developed, such as pruning, distillation, and quantization. Among these, quantization has gained prominence due to its ability to directly reduce the precision of model weights and activations, resulting in substantial memory savings and accelerated inference. Despite its advantages, existing quantization approaches face substantial challenges in ultra-low precisions (e.g., 2-bit), often resulting in severe performance degradation. To tackle this challenge, we propose a novel non-uniform quantization with minimal disturbance for LLM, which contains two innovations: (i) a Sensitivity-based Error Minimization Non-Uniform Quantization (SEMQ) algorithm, which finds the quantization scheme to minimize the quantization error through continuous iteration; and (ii) a Z-score-based method for outlier detection and isolation under the normal distribution assumption, reducing the complexity of the quantization process. The extensive experiments on the LLaMA family demonstrates that the proposed SEMQ enables the ultra-low precision quantization up to 2-bit, and 10<span><math><mo>×</mo></math></span> GPU memory reduction for origin LLMs while maintaining the model accuracy. Our code is publicly available at <span><span>https://github.com/ldm2060/semq</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108120"},"PeriodicalIF":6.2000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25004145","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLMs) represent a pivotal breakthrough in computational intelligence, showcasing exceptional capabilities in information aggregation and reasoning. However, their remarkable performance comes at the cost of ultra-high-scale parameters, leading to significant resource demands during deployment. Therefore, various model compression techniques have been developed, such as pruning, distillation, and quantization. Among these, quantization has gained prominence due to its ability to directly reduce the precision of model weights and activations, resulting in substantial memory savings and accelerated inference. Despite its advantages, existing quantization approaches face substantial challenges in ultra-low precisions (e.g., 2-bit), often resulting in severe performance degradation. To tackle this challenge, we propose a novel non-uniform quantization with minimal disturbance for LLM, which contains two innovations: (i) a Sensitivity-based Error Minimization Non-Uniform Quantization (SEMQ) algorithm, which finds the quantization scheme to minimize the quantization error through continuous iteration; and (ii) a Z-score-based method for outlier detection and isolation under the normal distribution assumption, reducing the complexity of the quantization process. The extensive experiments on the LLaMA family demonstrates that the proposed SEMQ enables the ultra-low precision quantization up to 2-bit, and 10 GPU memory reduction for origin LLMs while maintaining the model accuracy. Our code is publicly available at https://github.com/ldm2060/semq.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.