{"title":"LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning","authors":"Kailash Gogineni;Ali Suvizi;Guru Venkataramani","doi":"10.1109/OJCS.2025.3580498","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especially from a systems perspective. Fine-tuning is especially important for customizing compact models for edge applications, such as personal assistants running on local devices and models personalized with user-specific data, which in turn requires a deeper examination of fine-tuning performance and efficiency on single-GPU systems. Fine-tuning large models involves intensive matrix operations from backpropagation and gradient updates, which require extensive power and memory usage. In order to explore the range of performance optimization opportunities available to improve the LLM fine-tuning runtime, we understand the impact of techniques like activation checkpointing, low-rank adaptation, and operation fusion on LLM fine-tuning runtime optimization. In addition, we explore the effects of resource utilization through GPU peak power capping. Our experiments, conducted on NVIDIA RTX 4090 GPU using Meta’s LLaMA-3.1, Google’s Gemma, and Microsoft’s Phi-3, reveal that enabling all optimizations reduces memory usage by over 40% compared to FP32 baselines. Moreover, power capping to 300 W results in an average throughput drop of only 5.55% while reducing power consumption by 33%. Post-fine-tuning accuracy improvements on the Sycophancy Evaluation Benchmark range from 2% to 5%, depending on model architecture, validating that our optimization techniques preserve model quality while reducing resource requirements. Furthermore, we discuss several insights and potential future research directions from a systems perspective.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"987-1000"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11037824","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11037824/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especially from a systems perspective. Fine-tuning is especially important for customizing compact models for edge applications, such as personal assistants running on local devices and models personalized with user-specific data, which in turn requires a deeper examination of fine-tuning performance and efficiency on single-GPU systems. Fine-tuning large models involves intensive matrix operations from backpropagation and gradient updates, which require extensive power and memory usage. In order to explore the range of performance optimization opportunities available to improve the LLM fine-tuning runtime, we understand the impact of techniques like activation checkpointing, low-rank adaptation, and operation fusion on LLM fine-tuning runtime optimization. In addition, we explore the effects of resource utilization through GPU peak power capping. Our experiments, conducted on NVIDIA RTX 4090 GPU using Meta’s LLaMA-3.1, Google’s Gemma, and Microsoft’s Phi-3, reveal that enabling all optimizations reduces memory usage by over 40% compared to FP32 baselines. Moreover, power capping to 300 W results in an average throughput drop of only 5.55% while reducing power consumption by 33%. Post-fine-tuning accuracy improvements on the Sycophancy Evaluation Benchmark range from 2% to 5%, depending on model architecture, validating that our optimization techniques preserve model quality while reducing resource requirements. Furthermore, we discuss several insights and potential future research directions from a systems perspective.