TightLLM：通过自适应卸载策略最大化LLM推理的吞吐量

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2025-04-08 DOI:10.1109/TC.2025.3558009

Yitao Hu;Xiulong Liu;Guotao Yang;Linxuan Li;Kai Zeng;Zhixin Zhao;Sheng Chen;Laiping Zhao;Wenxin Li;Keqiu Li

{"title":"TightLLM：通过自适应卸载策略最大化LLM推理的吞吐量","authors":"Yitao Hu;Xiulong Liu;Guotao Yang;Linxuan Li;Kai Zeng;Zhixin Zhao;Sheng Chen;Laiping Zhao;Wenxin Li;Keqiu Li","doi":"10.1109/TC.2025.3558009","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, largely due to their substantial model size. However, this also results in significant GPU memory demands during inference. To address these challenges on hardware with limited GPU memory, existing approaches employ offloading techniques that offload unused tensors to CPU memory, thereby reducing GPU memory usage. Since offloading involves data transfer between GPU and CPU, it introduces transfer overhead. To mitigate this, prior works typically overlap data transfer with GPU computation using a fixed pipelining strategy applied uniformly across all inference iterations, referred to as <italic>static offloading. However, static offloading policies fail to maximize inference throughput because they cannot adapt to the dynamically changing transfer overhead during the inference process, leading to increasing GPU idleness and reduced inference throughput. We propose that offloading policies should be <italic>adaptive to the varying transfer overhead across inference iterations to maximize inference throughput. To this end, we design and implement an adaptive offloading-based inference system called TightLLM with two key innovations. First, its key-value (KV) distributor employs a <italic>trade-compute-for-transfer strategy to address growing transfer overhead by dynamically recomputing portions of the KV cache, effectively overlapping data transfer with computation and minimizing GPU idleness. Second, TightLLM's weight loader slices model weights and distributes the loading process <italic>across multiple batches, amortizing the excessive weight loading overhead and significantly improving throughput. Evaluation across various combinations of GPU hardware and LLM models shows that TightLLM achieves 1.3 to 23 times higher throughput during the decoding phase and 1.2 to 22 times higher throughput in the prefill phase compared to state-of-the-art offloading systems. Due to the higher throughput in prefill and decoding phases, TightLLM can reduce the completion time for large-scale tasks, which involve processing and generating a substantial number of tokens, by 59.6% to 94.9%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2195-2209"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TightLLM: Maximizing Throughput for LLM Inference via Adaptive Offloading Policy\",\"authors\":\"Yitao Hu;Xiulong Liu;Guotao Yang;Linxuan Li;Kai Zeng;Zhixin Zhao;Sheng Chen;Laiping Zhao;Wenxin Li;Keqiu Li\",\"doi\":\"10.1109/TC.2025.3558009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, largely due to their substantial model size. However, this also results in significant GPU memory demands during inference. To address these challenges on hardware with limited GPU memory, existing approaches employ offloading techniques that offload unused tensors to CPU memory, thereby reducing GPU memory usage. Since offloading involves data transfer between GPU and CPU, it introduces transfer overhead. To mitigate this, prior works typically overlap data transfer with GPU computation using a fixed pipelining strategy applied uniformly across all inference iterations, referred to as <italic>static offloading. However, static offloading policies fail to maximize inference throughput because they cannot adapt to the dynamically changing transfer overhead during the inference process, leading to increasing GPU idleness and reduced inference throughput. We propose that offloading policies should be <italic>adaptive to the varying transfer overhead across inference iterations to maximize inference throughput. To this end, we design and implement an adaptive offloading-based inference system called TightLLM with two key innovations. First, its key-value (KV) distributor employs a <italic>trade-compute-for-transfer strategy to address growing transfer overhead by dynamically recomputing portions of the KV cache, effectively overlapping data transfer with computation and minimizing GPU idleness. Second, TightLLM's weight loader slices model weights and distributes the loading process <italic>across multiple batches, amortizing the excessive weight loading overhead and significantly improving throughput. Evaluation across various combinations of GPU hardware and LLM models shows that TightLLM achieves 1.3 to 23 times higher throughput during the decoding phase and 1.2 to 22 times higher throughput in the prefill phase compared to state-of-the-art offloading systems. Due to the higher throughput in prefill and decoding phases, TightLLM can reduce the completion time for large-scale tasks, which involve processing and generating a substantial number of tokens, by 59.6% to 94.9%.\",\"PeriodicalId\":13087,\"journal\":{\"name\":\"IEEE Transactions on Computers\",\"volume\":\"74 7\",\"pages\":\"2195-2209\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computers\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10949701/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10949701/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（llm）已经在广泛的任务范围内展示了卓越的性能，这主要是由于它们的大量模型大小。然而，这也会导致在推理过程中显著的GPU内存需求。为了解决GPU内存有限的硬件上的这些挑战，现有的方法采用卸载技术，将未使用的张量卸载到CPU内存，从而减少GPU内存的使用。由于卸载涉及GPU和CPU之间的数据传输，它引入了传输开销。为了缓解这种情况，以前的工作通常使用固定的流水线策略将数据传输与GPU计算重叠，该策略统一应用于所有推理迭代，称为静态卸载。但是，静态卸载策略无法适应推理过程中传输开销的动态变化，导致GPU空闲增加，推理吞吐量降低，无法实现推理吞吐量的最大化。我们建议卸载策略应该适应跨推理迭代的不同传输开销，以最大限度地提高推理吞吐量。为此，我们设计并实现了一个基于自适应卸载的推理系统，称为TightLLM，其中有两个关键创新。首先，它的键值（KV）分发器采用一种以计算换传输的策略，通过动态地重新计算KV缓存的部分来解决不断增长的传输开销，有效地将数据传输与计算重叠，并最大限度地减少GPU空闲。其次，TightLLM的权重加载器对模型权重进行切片，并将加载过程分配到多个批次，从而平摊过多的权重加载开销，并显著提高吞吐量。对GPU硬件和LLM模型的各种组合的评估表明，与最先进的卸载系统相比，TightLLM在解码阶段实现了1.3到23倍的吞吐量，在预填充阶段实现了1.2到22倍的吞吐量。由于预填充和解码阶段的吞吐量更高，TightLLM可以将涉及处理和生成大量令牌的大规模任务的完成时间减少59.6%至94.9%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TightLLM: Maximizing Throughput for LLM Inference via Adaptive Offloading Policy

Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, largely due to their substantial model size. However, this also results in significant GPU memory demands during inference. To address these challenges on hardware with limited GPU memory, existing approaches employ offloading techniques that offload unused tensors to CPU memory, thereby reducing GPU memory usage. Since offloading involves data transfer between GPU and CPU, it introduces transfer overhead. To mitigate this, prior works typically overlap data transfer with GPU computation using a fixed pipelining strategy applied uniformly across all inference iterations, referred to as static offloading. However, static offloading policies fail to maximize inference throughput because they cannot adapt to the dynamically changing transfer overhead during the inference process, leading to increasing GPU idleness and reduced inference throughput. We propose that offloading policies should be adaptive to the varying transfer overhead across inference iterations to maximize inference throughput. To this end, we design and implement an adaptive offloading-based inference system called TightLLM with two key innovations. First, its key-value (KV) distributor employs a trade-compute-for-transfer strategy to address growing transfer overhead by dynamically recomputing portions of the KV cache, effectively overlapping data transfer with computation and minimizing GPU idleness. Second, TightLLM's weight loader slices model weights and distributes the loading process across multiple batches, amortizing the excessive weight loading overhead and significantly improving throughput. Evaluation across various combinations of GPU hardware and LLM models shows that TightLLM achieves 1.3 to 23 times higher throughput during the decoding phase and 1.2 to 22 times higher throughput in the prefill phase compared to state-of-the-art offloading systems. Due to the higher throughput in prefill and decoding phases, TightLLM can reduce the completion time for large-scale tasks, which involve processing and generating a substantial number of tokens, by 59.6% to 94.9%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.