通过重塑大型语言模型工作负载的功率配置文件来提供负载灵活性

IF 13.8 Q1 ENERGY & FUELS

Advances in Applied Energy Pub Date : 2025-07-15 DOI:10.1016/j.adapen.2025.100232

Yi Wang, Qinglai Guo, Min Chen

{"title":"通过重塑大型语言模型工作负载的功率配置文件来提供负载灵活性","authors":"Yi Wang, Qinglai Guo, Min Chen","doi":"10.1016/j.adapen.2025.100232","DOIUrl":null,"url":null,"abstract":"<div><div>The emergence of large language models (LLM) has driven a significant increase of AI workload in data center power demand. Renewable-powered solutions to decarbonizing LLM workload and reducing electricity costs are faced with the combined volatility of stochastic user requests and renewable energy. The key to removing the barriers in sustainable AI development lies in the adjustable capability of LLM power profiles. Therefore, this paper focuses on exploring the potential load flexibility of LLM workload and proposes a coordinated scheduling framework, notably, without computing performance degradation. Driven by the existence of the energy-optimal core frequency for graphics processing units (GPU), the energy-performance decoupling phenomenon is discovered and proved, where collaborative scaling in GPU quantity and frequency can change power but not computing performance. Motivated by this, the framework slows down the fine-tuning cluster and utilizes idle GPU resources from the inference cluster to maintain the computing performance of fine-tuning tasks. Consequently, the power consumption of the total cluster is reduced, which provides a fresh source of load flexibility. Furthermore, the framework employs dynamic frequency scaling to more flexibly modify the power profile of the expanded fine-tuning cluster. The computing performance is particularly guaranteed through temporal coupling constraints. In a simulated study supported by real-world data, the results prove a 6.8% power-saving ability and 11.3% cost-saving gains on average.</div></div>","PeriodicalId":34615,"journal":{"name":"Advances in Applied Energy","volume":"19 ","pages":"Article 100232"},"PeriodicalIF":13.8000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Providing load flexibility by reshaping power profiles of large language model workloads\",\"authors\":\"Yi Wang, Qinglai Guo, Min Chen\",\"doi\":\"10.1016/j.adapen.2025.100232\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The emergence of large language models (LLM) has driven a significant increase of AI workload in data center power demand. Renewable-powered solutions to decarbonizing LLM workload and reducing electricity costs are faced with the combined volatility of stochastic user requests and renewable energy. The key to removing the barriers in sustainable AI development lies in the adjustable capability of LLM power profiles. Therefore, this paper focuses on exploring the potential load flexibility of LLM workload and proposes a coordinated scheduling framework, notably, without computing performance degradation. Driven by the existence of the energy-optimal core frequency for graphics processing units (GPU), the energy-performance decoupling phenomenon is discovered and proved, where collaborative scaling in GPU quantity and frequency can change power but not computing performance. Motivated by this, the framework slows down the fine-tuning cluster and utilizes idle GPU resources from the inference cluster to maintain the computing performance of fine-tuning tasks. Consequently, the power consumption of the total cluster is reduced, which provides a fresh source of load flexibility. Furthermore, the framework employs dynamic frequency scaling to more flexibly modify the power profile of the expanded fine-tuning cluster. The computing performance is particularly guaranteed through temporal coupling constraints. In a simulated study supported by real-world data, the results prove a 6.8% power-saving ability and 11.3% cost-saving gains on average.</div></div>\",\"PeriodicalId\":34615,\"journal\":{\"name\":\"Advances in Applied Energy\",\"volume\":\"19 \",\"pages\":\"Article 100232\"},\"PeriodicalIF\":13.8000,\"publicationDate\":\"2025-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in Applied Energy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666792425000265\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENERGY & FUELS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Applied Energy","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666792425000265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENERGY & FUELS","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLM）的出现推动了数据中心电力需求中人工智能工作负载的显著增加。可再生能源解决方案既可以降低LLM工作量，又可以降低电力成本，同时还面临着随机用户需求和可再生能源的综合波动性。消除人工智能可持续发展障碍的关键在于LLM功率配置的可调能力。因此，本文重点探讨LLM工作负载的潜在负载灵活性，并提出一个协调的调度框架，特别是在不降低计算性能的情况下。在图形处理单元（GPU）能量最优核心频率存在的驱动下，发现并证明了能量性能解耦现象，即GPU数量和频率的协同缩放可以改变功耗，但不会改变计算性能。在此驱动下，框架降低了微调集群的速度，并利用推理集群的空闲GPU资源来维持微调任务的计算性能。因此，整个集群的功耗降低了，这为负载灵活性提供了新的来源。此外，该框架还采用了动态频率缩放，可以更灵活地修改扩展后的微调簇的功率分布。通过时间耦合约束特别保证了计算性能。在一个由真实数据支持的模拟研究中，结果证明了平均节省6.8%的功率和11.3%的成本收益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Providing load flexibility by reshaping power profiles of large language model workloads

The emergence of large language models (LLM) has driven a significant increase of AI workload in data center power demand. Renewable-powered solutions to decarbonizing LLM workload and reducing electricity costs are faced with the combined volatility of stochastic user requests and renewable energy. The key to removing the barriers in sustainable AI development lies in the adjustable capability of LLM power profiles. Therefore, this paper focuses on exploring the potential load flexibility of LLM workload and proposes a coordinated scheduling framework, notably, without computing performance degradation. Driven by the existence of the energy-optimal core frequency for graphics processing units (GPU), the energy-performance decoupling phenomenon is discovered and proved, where collaborative scaling in GPU quantity and frequency can change power but not computing performance. Motivated by this, the framework slows down the fine-tuning cluster and utilizes idle GPU resources from the inference cluster to maintain the computing performance of fine-tuning tasks. Consequently, the power consumption of the total cluster is reduced, which provides a fresh source of load flexibility. Furthermore, the framework employs dynamic frequency scaling to more flexibly modify the power profile of the expanded fine-tuning cluster. The computing performance is particularly guaranteed through temporal coupling constraints. In a simulated study supported by real-world data, the results prove a 6.8% power-saving ability and 11.3% cost-saving gains on average.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Advances in Applied Energy Energy-General Energy

CiteScore

23.90

自引率

0.00%

发文量

审稿时长

21 days