Efficient Compressing and Tuning Methods for Large Language Models: A Systematic Literature Review

IF 23.8 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

ACM Computing Surveys Pub Date : 2025-04-08 DOI:10.1145/3728636

Gun Il Kim, Sunga Hwang, Beakcheol Jang

{"title":"Efficient Compressing and Tuning Methods for Large Language Models: A Systematic Literature Review","authors":"Gun Il Kim, Sunga Hwang, Beakcheol Jang","doi":"10.1145/3728636","DOIUrl":null,"url":null,"abstract":"Efficient compression and tuning techniques have become indispensable in addressing the increasing computational and memory demands of large language models (LLMs). While these models have demonstrated exceptional performance across a wide range of natural language processing tasks, their growing size and resource requirements pose significant challenges to accessibility and sustainability. This survey systematically reviews state-of-the-art methods in model compression, including compression techniques such as knowledge distillation, low-rank approximation, parameter pruning, and quantization, as well as tuning techniques such as parameter-efficient fine-tuning and inference optimization. Compression techniques, though well-established in traditional deep learning, require updated methodologies tailored to the scale and dynamics of LLMs. Simultaneously, parameter-efficient fine-tuning, exemplified by techniques like Low-Rank Adaptation (LoRA) and query tuning, emerges as a promising solution for adapting models with minimal resource overhead. This study provides a detailed taxonomy of these methods, examining their practical applications, strengths, and limitations. Critical gaps are identified in scalability, and the integration of compression and tuning strategies, signaling the need for unified frameworks and hybrid approaches to maximize efficiency and performance. By addressing these challenges, this survey aims to guide researchers toward sustainable, efficient, and accessible LLM development, ensuring their broader applicability across diverse domains while mitigating resource constraints.","PeriodicalId":50926,"journal":{"name":"ACM Computing Surveys","volume":"4 1","pages":""},"PeriodicalIF":23.8000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Computing Surveys","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3728636","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Efficient compression and tuning techniques have become indispensable in addressing the increasing computational and memory demands of large language models (LLMs). While these models have demonstrated exceptional performance across a wide range of natural language processing tasks, their growing size and resource requirements pose significant challenges to accessibility and sustainability. This survey systematically reviews state-of-the-art methods in model compression, including compression techniques such as knowledge distillation, low-rank approximation, parameter pruning, and quantization, as well as tuning techniques such as parameter-efficient fine-tuning and inference optimization. Compression techniques, though well-established in traditional deep learning, require updated methodologies tailored to the scale and dynamics of LLMs. Simultaneously, parameter-efficient fine-tuning, exemplified by techniques like Low-Rank Adaptation (LoRA) and query tuning, emerges as a promising solution for adapting models with minimal resource overhead. This study provides a detailed taxonomy of these methods, examining their practical applications, strengths, and limitations. Critical gaps are identified in scalability, and the integration of compression and tuning strategies, signaling the need for unified frameworks and hybrid approaches to maximize efficiency and performance. By addressing these challenges, this survey aims to guide researchers toward sustainable, efficient, and accessible LLM development, ensuring their broader applicability across diverse domains while mitigating resource constraints.

查看原文本刊更多论文

大型语言模型的有效压缩和调优方法：系统的文献综述

高效的压缩和调优技术在解决大型语言模型（llm）日益增长的计算和内存需求方面已经变得不可或缺。虽然这些模型在广泛的自然语言处理任务中表现出优异的性能，但它们不断增长的规模和资源需求对可访问性和可持续性构成了重大挑战。本调查系统地回顾了模型压缩的最新方法，包括知识蒸馏、低秩近似、参数修剪和量化等压缩技术，以及参数有效微调和推理优化等调优技术。压缩技术虽然在传统的深度学习中得到了完善，但需要针对法学硕士的规模和动态定制更新的方法。同时，以Low-Rank Adaptation （LoRA）和查询调优等技术为例的参数高效微调，作为一种很有前途的解决方案出现了，可以用最小的资源开销来调整模型。本研究提供了这些方法的详细分类，检查它们的实际应用，优势和局限性。在可伸缩性以及压缩和调优策略的集成方面发现了关键差距，表明需要统一框架和混合方法来最大化效率和性能。通过解决这些挑战，本调查旨在引导研究人员走向可持续、高效和可访问的法学硕士开发，确保其在不同领域的更广泛适用性，同时减轻资源限制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Computing Surveys 工程技术-计算机：理论方法

CiteScore

33.20

自引率

0.60%

发文量

372

审稿时长

12 months

期刊介绍： ACM Computing Surveys is an academic journal that focuses on publishing surveys and tutorials on various areas of computing research and practice. The journal aims to provide comprehensive and easily understandable articles that guide readers through the literature and help them understand topics outside their specialties. In terms of impact, CSUR has a high reputation with a 2022 Impact Factor of 16.6. It is ranked 3rd out of 111 journals in the field of Computer Science Theory & Methods. ACM Computing Surveys is indexed and abstracted in various services, including AI2 Semantic Scholar, Baidu, Clarivate/ISI: JCR, CNKI, DeepDyve, DTU, EBSCO: EDS/HOST, and IET Inspec, among others.