{"title":"FlashDecoding++Next: High Throughput LLM Inference With Latency and Memory Optimization","authors":"Guohao Dai;Ke Hong;Qiuli Mao;Xiuhong Li;Jiaming Xu;Haofeng Huang;Hongtu Xia;Xuefei Ning;Shengen Yan;Yun Liang;Yu Wang","doi":"10.1109/TC.2025.3585339","DOIUrl":null,"url":null,"abstract":"As the Large Language Model (LLM) becomes increasingly important in various domains, the performance of LLM inference is crucial to massive LLM applications. However, centering around the computational efficiency and the memory utilization, the following challenges remain unsolved in achieving high-throughput LLM inference: (1) Synchronous partial softmax update. The softmax operation requires a synchronous update operation among each partial softmax result, leading to <inline-formula><tex-math>$\\sim$</tex-math></inline-formula>20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference tends to be flat, leading to under-utilized computation and 50% performance loss after padding zeros in previous designs (<i>e.g.,</i> cuBLAS, CUTLASS, etc.). (3) Memory redundancy caused by activations. Dynamic allocation of activations during inference leads to redundant storage of useless variables, bringing 22% more memory consumption. We present <i>FlashDecoding++Next</i>, a high-throughput inference engine supporting mainstream LLMs and hardware backends. To tackle the above challenges, <i>FlashDecoding++Next</i> creatively proposes: <b>(1) Asynchronous softmax with unified maximum.</b> <i>FlashDecoding++Next</i> introduces a unified maximum technique for different partial softmax computations to avoid synchronization. Based on this, a fine-grained pipelining is proposed, leading to 1.18<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> and 1.14<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> for the <i>prefill</i> and decode phases in LLM inference, respectively. <b>(2) Flat GEMM optimization with double buffering.</b> <i>FlashDecoding++Next</i> points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced, resulting in up to 52% speedup for the flat GEMM operation. (3) Buffer reusing and unified memory management. <i>FlashDecoding++Next</i> reuses the pre-allocated activation buffers throughout the inference process to remove redundancy. Based on that, we unify the management of different types of storage to further exploit the reusing opportunity. The memory optimization enables up to 1.57<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> longer sequence to be processed. <i>FlashDecoding++Next</i> demonstrates remarkable throughput improvement, delivering up to <b>68.88</b><inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> higher throughput compared to the HuggingFace <xref>[1]</xref> implementation. On average, <i>FlashDecoding++Next</i> achieves <b>1.25</b><inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> and <b>1.46</b><inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> higher throughput compared to vLLM <xref>[2]</xref> and TensorRT-LLM <xref>[3]</xref> on mainstream LLMs.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3263-3276"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11062854/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
As the Large Language Model (LLM) becomes increasingly important in various domains, the performance of LLM inference is crucial to massive LLM applications. However, centering around the computational efficiency and the memory utilization, the following challenges remain unsolved in achieving high-throughput LLM inference: (1) Synchronous partial softmax update. The softmax operation requires a synchronous update operation among each partial softmax result, leading to $\sim$20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference tends to be flat, leading to under-utilized computation and 50% performance loss after padding zeros in previous designs (e.g., cuBLAS, CUTLASS, etc.). (3) Memory redundancy caused by activations. Dynamic allocation of activations during inference leads to redundant storage of useless variables, bringing 22% more memory consumption. We present FlashDecoding++Next, a high-throughput inference engine supporting mainstream LLMs and hardware backends. To tackle the above challenges, FlashDecoding++Next creatively proposes: (1) Asynchronous softmax with unified maximum.FlashDecoding++Next introduces a unified maximum technique for different partial softmax computations to avoid synchronization. Based on this, a fine-grained pipelining is proposed, leading to 1.18$\boldsymbol{\times}$ and 1.14$\boldsymbol{\times}$ for the prefill and decode phases in LLM inference, respectively. (2) Flat GEMM optimization with double buffering.FlashDecoding++Next points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced, resulting in up to 52% speedup for the flat GEMM operation. (3) Buffer reusing and unified memory management. FlashDecoding++Next reuses the pre-allocated activation buffers throughout the inference process to remove redundancy. Based on that, we unify the management of different types of storage to further exploit the reusing opportunity. The memory optimization enables up to 1.57$\boldsymbol{\times}$ longer sequence to be processed. FlashDecoding++Next demonstrates remarkable throughput improvement, delivering up to 68.88$\boldsymbol{\times}$ higher throughput compared to the HuggingFace [1] implementation. On average, FlashDecoding++Next achieves 1.25$\boldsymbol{\times}$ and 1.46$\boldsymbol{\times}$ higher throughput compared to vLLM [2] and TensorRT-LLM [3] on mainstream LLMs.
期刊介绍:
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.