IEEE Transactions on Computers最新文献_第5页

A High-Efficiency Parallel Mechanism for Canonical Polyadic Decomposition on Heterogeneous Computing Platform 异构计算平台上标准多元分解的高效并行机制

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587623

Xiaosong Peng;Laurence T. Yang;Xiaokang Wang;Debin Liu;Jie Li

{"title":"A High-Efficiency Parallel Mechanism for Canonical Polyadic Decomposition on Heterogeneous Computing Platform","authors":"Xiaosong Peng;Laurence T. Yang;Xiaokang Wang;Debin Liu;Jie Li","doi":"10.1109/TC.2025.3587623","DOIUrl":"https://doi.org/10.1109/TC.2025.3587623","url":null,"abstract":"Canonical Polyadic decomposition (CPD) obtains the low-rank approximation for high-order multidimensional tensors through the summation of a sequence of rank-one tensors, greatly reducing storage and computation overhead. It is increasingly being used in the lightweight design of artificial intelligence and big data processing. The existing CPD technology exhibits inherent limitations in simultaneously achieving high accuracy and high efficiency. In this paper, a heterogeneous computing method for CPD is proposed to optimize computing efficiency with guaranteed convergence accuracy. Specifically, a quasi-convex decomposition loss function is constructed and the extreme points of the Kruskal matrix rows have been solved. Further, the massively parallelized operators in the algorithm are extracted, a software-hardware integrated scheduling method is designed, and the deployment of CPD on heterogeneous computing platforms is achieved. Finally, the memory access strategy is optimized to improve memory access efficiency. We tested the algorithm on real-world and synthetic sparse tensor datasets, numerical experimental results show that compared with the state-of-the-art method, the proposed method has a higher convergence accuracy and computing efficiency. Compared to the standard CPD parallel library, the method achieves efficiency improvements of tens to hundreds of times while maintaining the same accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3377-3389"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reliable and Efficient Multi-Path Transmission Based on Disjoint Paths in Data Center Networks 数据中心网络中基于不相交路径的可靠高效多径传输

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587618

Weibei Fan;Yao Pan;Fu Xiao;Mengjie Lv;Lei Han;Shui Yu

{"title":"Reliable and Efficient Multi-Path Transmission Based on Disjoint Paths in Data Center Networks","authors":"Weibei Fan;Yao Pan;Fu Xiao;Mengjie Lv;Lei Han;Shui Yu","doi":"10.1109/TC.2025.3587618","DOIUrl":"https://doi.org/10.1109/TC.2025.3587618","url":null,"abstract":"Multi-path transmission enables load balancing and improves network performance in data center networks (DCNs). It increases the possibility of network congestion and makes traditional network traffic engineering methods inefficient due to the uneven distribution of network traffic in data centers. In this paper, we present a reliable and efficient Disjoint paths based Multi-Path Transmission scheme (<italic>DMPT) that selects distributed requests through topology awareness. Firstly, we propose disjoint path construction algorithms through rigorous theoretical proof, aiming at the different transmission requirements of DCNs. Secondly, we offer an optimal solution to the disjoint multi-path selection problem, which is aimed at the trade-off between link load and transmission time. Furthermore, <italic>DMPT can split the flow over multiple transmission paths based on the link status. Finally, extensive experiments are executed for <italic>DMPT on a novel EHDC of DCN that is based on exchanged hypercube. The experimental results show that <italic>DMPT can reduce the average running time by 18.6%, and the average path length is close to the optimal path. Furthermore, it achieves significant improvements in balancing network link traffic and facilitating deployment, which also reflects the advantages of topology aware multiplexing in practice.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3362-3376"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145059841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hybrid Redundancy for Reliable Task Offloading in Collaborative Edge Computing 协同边缘计算中可靠任务卸载的混合冗余

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587620

Hao Guo;Lei Yang;Qingfeng Zhang;Jiannong Cao

{"title":"Hybrid Redundancy for Reliable Task Offloading in Collaborative Edge Computing","authors":"Hao Guo;Lei Yang;Qingfeng Zhang;Jiannong Cao","doi":"10.1109/TC.2025.3587620","DOIUrl":"https://doi.org/10.1109/TC.2025.3587620","url":null,"abstract":"Collaborative edge computing enables task execution on the computing resources of geo-distributed edge nodes. One of the key challenges in this field is to realize reliable task offloading by deciding whether to execute tasks locally or delegate them to neighboring nodes while ensuring task reliability. Achieving reliable task offloading is essential for preventing task failures and maintaining optimal system performance. Existing works commonly rely on task redundancy strategies, such as active or passive redundancy. However, these approaches lack adaptive redundancy mechanisms to respond to changes in the network environment, potentially resulting in resource wastage from excessive redundancy or task failures due to insufficient redundancy. In this work, we introduce a novel approach called Hybrid Redundancy for Task Offloading (HRTO) to optimize task latency and reliability. Specifically, HRTO utilizes deep reinforcement learning (DRL) to learn a task offloading policy that maximizes task success rates. With this policy, edge nodes dynamically adjust task redundancy levels based on real-time network load conditions and meanwhile assess whether the task instance is necessary for re-execution in case of task failure. Extensive experiments on real-world network topologies and a Kubernetes-based testbed evaluate the effectiveness of HRTO, showing a 14.6% increase in success rate over the benchmarks.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3238-3250"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144814149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Generation of System-Level Test for Un-Core Logic of Large Automotive SoC 大型汽车SoC非核逻辑系统级测试自动生成

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587515

Francesco Angione;Paolo Bernardi;Giusy Iaria;Claudia Bertani;Vincenzo Tancorre

引用次数: 0

Scavenger+: Revisiting Space-Time Tradeoffs in Key-Value Separated LSM-Trees 清道夫+：重访键值分离lsm树中的时空权衡

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587513

Jianshun Zhang;Fang Wang;Jiaxin Ou;Yi Wang;Ming Zhao;Sheng Qiu;Junxun Huang;Baoquan Li;Peng Fang;Dan Feng

{"title":"Scavenger+: Revisiting Space-Time Tradeoffs in Key-Value Separated LSM-Trees","authors":"Jianshun Zhang;Fang Wang;Jiaxin Ou;Yi Wang;Ming Zhao;Sheng Qiu;Junxun Huang;Baoquan Li;Peng Fang;Dan Feng","doi":"10.1109/TC.2025.3587513","DOIUrl":"https://doi.org/10.1109/TC.2025.3587513","url":null,"abstract":"Key-Value Stores (KVS) based on log-structured merge-trees (LSM-trees) are widely used in storage systems but face significant challenges, such as high write amplification caused by compaction. KV-separated LSM-trees address write amplification but introduce significant space amplification, a critical concern in cost-sensitive scenarios. Garbage collection (GC) can reduce space amplification, but existing strategies are often inefficient and fail to account for workload characteristics. Moreover, current key-value (KV) separated LSM-trees overlook the space amplification caused by the index LSM-tree. In this paper, we systematically analyze the sources of space amplification in KV-separated LSM-trees and propose Scavenger+, which achieves a better performance-space trade-off. Scavenger+ introduces (1) an I/O-efficient garbage collection scheme to reduce I/O overhead, (2) a space-aware compaction strategy based on compensated size to mitigate index-induced space amplification, and (3) a dynamic GC scheduler that adapts to system load to make better use of CPU and storage resources. Extensive experiments demonstrate that Scavenger+ significantly improves write performance and reduces space amplification compared to state-of-the-art KV-separated LSM-trees, including BlobDB, Titan, and TerarkDB.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3332-3346"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards a Unified Framework for Modeling and Analyzing User-Defined Online Non-Preemptive Scheduling Policies 基于统一框架的自定义在线非抢占调度策略建模与分析

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587514

Pourya Gohari;Jeroen Voeten;Mitra Nasri

{"title":"Towards a Unified Framework for Modeling and Analyzing User-Defined Online Non-Preemptive Scheduling Policies","authors":"Pourya Gohari;Jeroen Voeten;Mitra Nasri","doi":"10.1109/TC.2025.3587514","DOIUrl":"https://doi.org/10.1109/TC.2025.3587514","url":null,"abstract":"This paper presents a unified formal framework, called ReTA, that allows users to define <italic>scheduling problems using a user-friendly domain-specific language (DSL) and automatically obtain response times of jobs in return. ReTA supports user-defined online scheduling policies (beyond work-conserving or priority-based scheduling) for heterogeneous computing resource types with multiple instances per type (e.g., multiple CPU cores, GPUs, DSPs, and FPGAs on one single chip), thus supporting global, partitioned, and clustered scheduling. In the current version of ReTA, we focus on non-preemptive periodic tasks as these are susceptible to scheduling anomalies and hence harder to analyze. ReTA performs response-time analysis by constructing a <italic>timed labeled transition system (TLTS) from the domain model as a basis for performing a reachability analysis enriched with efficient state-space reduction techniques. Our empirical evaluations show that ReTA identifies up to <italic>50 times more schedulable task sets than fixed-point iteration-based analyses. With a runtime on the order of a few minutes, ReTA produces highly accurate results <italic>two-orders of magnitude faster than an exact Timed Automata-based analysis in UPPAAL (e.g., for systems with 16 cores and 32 tasks).","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3347-3361"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Trajectory Optimization and Power Allocation for Multi-UAV Wireless Networks: A Communication-Based Multi-Agent Deep Reinforcement Learning Approach 多无人机无线网络的轨迹优化与功率分配：基于通信的多智能体深度强化学习方法

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587976

Zimeng Yuan;Yuanguo Bi;Yanbo Fan;Yuheng Liu;Lianbo Ma;Liang Zhao;Qiang He

{"title":"Trajectory Optimization and Power Allocation for Multi-UAV Wireless Networks: A Communication-Based Multi-Agent Deep Reinforcement Learning Approach","authors":"Zimeng Yuan;Yuanguo Bi;Yanbo Fan;Yuheng Liu;Lianbo Ma;Liang Zhao;Qiang He","doi":"10.1109/TC.2025.3587976","DOIUrl":"https://doi.org/10.1109/TC.2025.3587976","url":null,"abstract":"Uncrewed Aerial Vehicles (UAVs) play a crucial role in next-generation mobile communication systems, serving as aerial base stations to provide services when ground base stations fail to meet coverage requirements. However, trajectory planning and power allocation for collaborative UAVs as Aerial Base Stations (UAV-ABSs) face several challenges, including energy limitations, flight time constraints, high optimization complexity due to dynamic environment interactions, and insufficient decision-making information. To address these challenges, this paper proposes a multi-agent reinforcement learning algorithm, namely Communication Actor Centralized Attention Critic Algorithm (CATEN), to jointly optimize the flight trajectory and power allocation strategies of UAV-ABSs. The proposed algorithm aims to maximize the number of users meeting Quality of Service (QoS) requirements while minimizing UAV-ABSs energy consumption. To achieve this, firstly, an information sharing mechanism is designed to improve the collaboration efficiency among UAV-ABSs. It leverages distributed storage, intelligent scheduling of UAV-ABSs interaction experiences, and gating units to enhance information screening and fusion. Secondly, a multi-head attention critic network is proposed to capture correlations among UAV-ABSs from different subspaces. This allows the network to prioritize value information, reduce redundancy, and strengthen UAV-ABSs collaboration and decision-making capabilities. Simulation results demonstrate that CATEN achieves better performance in terms of the number of served users and energy consumption compared to existing algorithms, exhibiting good robustness and adaptability in dynamic environments.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3404-3418"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UKFaaS: Lightweight, High-Performance and Secure FaaS Communication With Unikernel UKFaaS：使用Unikernel的轻量级、高性能和安全的FaaS通信

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-07-04 DOI: 10.1109/TC.2025.3586031

Zhenqian Chen;Yuchun Zhan;Peng Hu;Xinkui Zhao;Muyu Yang;Siwei Tan;Lufei Zhang;Liqiang Lu;Jianwei Yin;Zuoning Chen

{"title":"UKFaaS: Lightweight, High-Performance and Secure FaaS Communication With Unikernel","authors":"Zhenqian Chen;Yuchun Zhan;Peng Hu;Xinkui Zhao;Muyu Yang;Siwei Tan;Lufei Zhang;Liqiang Lu;Jianwei Yin;Zuoning Chen","doi":"10.1109/TC.2025.3586031","DOIUrl":"https://doi.org/10.1109/TC.2025.3586031","url":null,"abstract":"Unikernel is a promising runtime for serverless computing with its lightweight and isolated architecture. It offers a secure and efficient environment for applications. However, famous serverless frameworks like Knative have introduced heavyweight component sidecars to assist function instance deployment in a non-intrusive manner. But the sidecar not only hinders the throughput of unikernel function services but also consumes excessive memory resources. Moreover, the intricate network communication pathways among various services pose significant challenges for deploying unikernels in production serverless environments. Although shared-memory based communication on the same server can solve the communication bottleneck of unikernel-based function instances. The situation where malicious programs on the server make the shared memory untrustworthy limits the deployment of such technologies. We propose UKFaaS, a lightweight and high-performance serverless framework. UKFaaS leverages the advantages of customized operating systems through unikernel and it non-intrusively integrates sidecar functionality into the unikernel, avoiding the overhead of sidecar request forwarding. Additionally, UKFaaS innovatively implements data communication between unikernels in the same server to eliminate VM-Exit bottlenecks in RPC (remote process call) based on VMFUNC without relying on memory sharing. The preliminary experimental results indicate that UKFaaS can realize <inline-formula><tex-math>$1.8boldsymbol{times}$</tex-math></inline-formula>-<inline-formula><tex-math>$3.5boldsymbol{times}$</tex-math></inline-formula> request throughput per second (RPS) compared with the advanced serverless system FaasFlow, UaaF and Nightcore in the Google online boutique microservice benchmark.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3305-3318"},"PeriodicalIF":3.8,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145059832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

COFFA: A Co-Design Framework for Fused-Grained Reconfigurable Architecture Towards Efficient Irregular Loop Handling 面向高效不规则循环处理的融合粒度可重构架构协同设计框架

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-07-02 DOI: 10.1109/TC.2025.3585345

Yuan Dai;Xuchen Gao;Yunhui Qiu;Jingyuan Li;Yuhang Cao;Yiqing Mao;Sichao Chen;Wenbo Yin;Wai-Shing Luk;Lingli Wang

{"title":"COFFA: A Co-Design Framework for Fused-Grained Reconfigurable Architecture Towards Efficient Irregular Loop Handling","authors":"Yuan Dai;Xuchen Gao;Yunhui Qiu;Jingyuan Li;Yuhang Cao;Yiqing Mao;Sichao Chen;Wenbo Yin;Wai-Shing Luk;Lingli Wang","doi":"10.1109/TC.2025.3585345","DOIUrl":"https://doi.org/10.1109/TC.2025.3585345","url":null,"abstract":"Coarse-Grained Reconfigurable Architecture (CGRA) emerges as a competitive accelerator due to its high flexibility and energy efficiency. However, most CGRAs are effective for computation-intensive applications with regular loops but struggle with irregular loops containing control flows. These loops introduce fine-grained logic operations and are costly to execute by coarse-grained arithmetic units in CGRA. Efficiently handling such logic operations necessitates incorporating Boolean algebra optimization, which can improve logic density and reduce logic depth. Unfortunately, no previous research has incorporated it into the compilation flow to support irregular loops efficiently. We propose COFFA, an open-source framework for heterogeneous architecture with a RISC-V CPU and a fused-grained reconfigurable accelerator, which integrates coarse-grained arithmetic and fine-grained logic units, along with flexible IO units and distributed interconnects. As a software/hardware co-design framework, COFFA has a powerful compiler that extracts and optimizes fine-grained logic operations from irregular loops, performs coarse-grained arithmetic and memory optimizations, and offloads the loops to the accelerator. Across various challenging benchmarks with irregular loops, COFFA achieves significant performance and energy efficiency improvements over an in-order, an out-of-order RISC-V CPUs, and a recent FPGA, respectively. Moreover, compared with the state-of-the-art CGRA UE-CGRA and Hycube, COFFA can achieve 2.5<inline-formula><tex-math>$times$</tex-math></inline-formula> and 3.5<inline-formula><tex-math>$times$</tex-math></inline-formula> performance gains, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3099-3113"},"PeriodicalIF":3.8,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FlashDecoding++Next: High Throughput LLM Inference With Latency and Memory Optimization FlashDecoding++Next：高吞吐量LLM推理与延迟和内存优化

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-07-02 DOI: 10.1109/TC.2025.3585339

Guohao Dai;Ke Hong;Qiuli Mao;Xiuhong Li;Jiaming Xu;Haofeng Huang;Hongtu Xia;Xuefei Ning;Shengen Yan;Yun Liang;Yu Wang

{"title":"FlashDecoding++Next: High Throughput LLM Inference With Latency and Memory Optimization","authors":"Guohao Dai;Ke Hong;Qiuli Mao;Xiuhong Li;Jiaming Xu;Haofeng Huang;Hongtu Xia;Xuefei Ning;Shengen Yan;Yun Liang;Yu Wang","doi":"10.1109/TC.2025.3585339","DOIUrl":"https://doi.org/10.1109/TC.2025.3585339","url":null,"abstract":"As the Large Language Model (LLM) becomes increasingly important in various domains, the performance of LLM inference is crucial to massive LLM applications. However, centering around the computational efficiency and the memory utilization, the following challenges remain unsolved in achieving high-throughput LLM inference: (1) Synchronous partial softmax update. The softmax operation requires a synchronous update operation among each partial softmax result, leading to <inline-formula><tex-math>$sim$</tex-math></inline-formula>20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference tends to be flat, leading to under-utilized computation and 50% performance loss after padding zeros in previous designs (e.g., cuBLAS, CUTLASS, etc.). (3) Memory redundancy caused by activations. Dynamic allocation of activations during inference leads to redundant storage of useless variables, bringing 22% more memory consumption. We present FlashDecoding++Next, a high-throughput inference engine supporting mainstream LLMs and hardware backends. To tackle the above challenges, FlashDecoding++Next creatively proposes: (1) Asynchronous softmax with unified maximum. FlashDecoding++Next introduces a unified maximum technique for different partial softmax computations to avoid synchronization. Based on this, a fine-grained pipelining is proposed, leading to 1.18<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> and 1.14<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> for the prefill and decode phases in LLM inference, respectively. (2) Flat GEMM optimization with double buffering. FlashDecoding++Next points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced, resulting in up to 52% speedup for the flat GEMM operation. (3) Buffer reusing and unified memory management. FlashDecoding++Next reuses the pre-allocated activation buffers throughout the inference process to remove redundancy. Based on that, we unify the management of different types of storage to further exploit the reusing opportunity. The memory optimization enables up to 1.57<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> longer sequence to be processed. FlashDecoding++Next demonstrates remarkable throughput improvement, delivering up to 68.88<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> higher throughput compared to the HuggingFace <xref>[1]</xref> implementation. On average, FlashDecoding++Next achieves 1.25<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> and 1.46<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> higher throughput compared to vLLM <xref>[2]</xref> and TensorRT-LLM <xref>[3]</xref> on ma","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3263-3276"},"PeriodicalIF":3.8,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0