LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning

IEEE Open Journal of the Computer Society Pub Date : 2025-06-17 DOI:10.1109/OJCS.2025.3580498

Kailash Gogineni;Ali Suvizi;Guru Venkataramani

{"title":"LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning","authors":"Kailash Gogineni;Ali Suvizi;Guru Venkataramani","doi":"10.1109/OJCS.2025.3580498","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especially from a systems perspective. Fine-tuning is especially important for customizing compact models for edge applications, such as personal assistants running on local devices and models personalized with user-specific data, which in turn requires a deeper examination of fine-tuning performance and efficiency on single-GPU systems. Fine-tuning large models involves intensive matrix operations from backpropagation and gradient updates, which require extensive power and memory usage. In order to explore the range of performance optimization opportunities available to improve the LLM fine-tuning runtime, we understand the impact of techniques like activation checkpointing, low-rank adaptation, and operation fusion on LLM fine-tuning runtime optimization. In addition, we explore the effects of resource utilization through GPU peak power capping. Our experiments, conducted on NVIDIA RTX 4090 GPU using Meta’s LLaMA-3.1, Google’s Gemma, and Microsoft’s Phi-3, reveal that enabling all optimizations reduces memory usage by over 40% compared to FP32 baselines. Moreover, power capping to 300 W results in an average throughput drop of only 5.55% while reducing power consumption by 33%. Post-fine-tuning accuracy improvements on the Sycophancy Evaluation Benchmark range from 2% to 5%, depending on model architecture, validating that our optimization techniques preserve model quality while reducing resource requirements. Furthermore, we discuss several insights and potential future research directions from a systems perspective.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"987-1000"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11037824","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11037824/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especially from a systems perspective. Fine-tuning is especially important for customizing compact models for edge applications, such as personal assistants running on local devices and models personalized with user-specific data, which in turn requires a deeper examination of fine-tuning performance and efficiency on single-GPU systems. Fine-tuning large models involves intensive matrix operations from backpropagation and gradient updates, which require extensive power and memory usage. In order to explore the range of performance optimization opportunities available to improve the LLM fine-tuning runtime, we understand the impact of techniques like activation checkpointing, low-rank adaptation, and operation fusion on LLM fine-tuning runtime optimization. In addition, we explore the effects of resource utilization through GPU peak power capping. Our experiments, conducted on NVIDIA RTX 4090 GPU using Meta’s LLaMA-3.1, Google’s Gemma, and Microsoft’s Phi-3, reveal that enabling all optimizations reduces memory usage by over 40% compared to FP32 baselines. Moreover, power capping to 300 W results in an average throughput drop of only 5.55% while reducing power consumption by 33%. Post-fine-tuning accuracy improvements on the Sycophancy Evaluation Benchmark range from 2% to 5%, depending on model architecture, validating that our optimization techniques preserve model quality while reducing resource requirements. Furthermore, we discuss several insights and potential future research directions from a systems perspective.

查看原文本刊更多论文

预算上的法学硕士：系统级方法的节能和可扩展的微调

大型语言模型（llm）在各种应用中表现出了非凡的能力，包括机器人、电信和科学发现。虽然LLM推理和训练阶段受到了很多关注，但微调受到的关注较少，尽管其成本不断增加，特别是从系统的角度来看。微调对于为边缘应用定制紧凑型模型尤其重要，例如在本地设备上运行的个人助理和使用用户特定数据个性化的模型，这反过来又需要对单gpu系统上的微调性能和效率进行更深入的检查。对大型模型进行微调涉及来自反向传播和梯度更新的密集矩阵操作，这需要大量的功率和内存使用。为了探索可用于改进LLM微调运行时的性能优化机会范围，我们了解了激活检查点、低级别适应和操作融合等技术对LLM微调运行时优化的影响。此外，我们还探讨了通过GPU峰值功率封顶对资源利用的影响。我们使用Meta的LLaMA-3.1、谷歌的Gemma和微软的Phi-3在NVIDIA RTX 4090 GPU上进行的实验表明，与FP32基准相比，启用所有优化可减少40%以上的内存使用。此外，功率上限为300w，平均吞吐量仅下降5.55%，而功耗降低33%。根据模型架构的不同，在谄媚评估基准上进行微调后的精度改进范围从2%到5%不等，验证了我们的优化技术在减少资源需求的同时保持了模型质量。此外，我们还从系统的角度讨论了一些见解和潜在的未来研究方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Open Journal of the Computer Society

CiteScore

12.60

自引率

0.00%

发文量