IEEE Transactions on Parallel and Distributed Systems最新文献

筛选
英文 中文
Mariana: Exploring Native SkipList Index Design for Disaggregated Memory 探索分解内存的本地SkipList索引设计
IF 6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-07 DOI: 10.1109/TPDS.2025.3596988
Xing Wei;Ke Wang;Yinjun Han;Hao Jin;Yaofeng Tu;Huiqi Hu;Xuan Zhou;Minghao Zhao
{"title":"Mariana: Exploring Native SkipList Index Design for Disaggregated Memory","authors":"Xing Wei;Ke Wang;Yinjun Han;Hao Jin;Yaofeng Tu;Huiqi Hu;Xuan Zhou;Minghao Zhao","doi":"10.1109/TPDS.2025.3596988","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3596988","url":null,"abstract":"Memory disaggregation has emerged as a promising architecture for improving resource efficiency by decoupling the computing and memory resources. But building efficient range indices in such an architecture faces three critical challenges: (1) coarse-grained concurrency control schemes for coordinating concurrent read/write operations with node splitting incur high contention under the skewed and write-intensive workloads; (2) existing data layouts fail to balance consistency verification and hardware acceleration via SIMD (Single Instruction Multiple Data); and (3) naive caching schemes struggle to adapt to rapidly changing access patterns. To address these challenges, we propose <small>Mariana</small>, a memory-disaggregated skiplist index that integrates three key innovations. First, it uses a fine-grained (i.e., entry-level) latch mechanism combined with dynamic node resizing to minimize the contention and splitting frequency. Second, it employs a tailored data layout for leaf node, which separates keys and values to enable SIMD acceleration while maintaining consistency checks with minimal write overhead. Third, it implements an adaptive caching strategy that tracks node popularity in real-time to optimize network bandwidth utilization during the index traversal. Experimental results show that <small>Mariana</small> achieves <inline-formula><tex-math>$1.7times$</tex-math></inline-formula> higher throughput under write-intensive workloads and reduces the P90 latency by 23% under the read-intensive workloads, when comparing to the state-of-the-art indices on disaggregated memory.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2137-2151"},"PeriodicalIF":6.0,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MUCVR: Edge Computing-Enabled High-Quality Multi-User Collaboration for Interactive MVR MUCVR:支持边缘计算的交互式MVR高质量多用户协作
IF 6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-04 DOI: 10.1109/TPDS.2025.3595801
Weimin Li;Qin Li;Weihong Tian;Jie Gao;Fan Wu;Jianxun Liu;Ju Ren
{"title":"MUCVR: Edge Computing-Enabled High-Quality Multi-User Collaboration for Interactive MVR","authors":"Weimin Li;Qin Li;Weihong Tian;Jie Gao;Fan Wu;Jianxun Liu;Ju Ren","doi":"10.1109/TPDS.2025.3595801","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3595801","url":null,"abstract":"Mobile Virtual Reality (MVR), which aims to provide high-quality VR services to mobile devices of end users, has become the latest trend in virtual reality developments. The current MVR solution is to remotely render frame data from a cloud server, while the potential of edge computing in MVR is underexploited. In this paper, we propose a new approach named MUCVR to achieve high-quality interactive MVR collaboration for multiple users by exploiting edge computing. First, we design “vertical” edge–cloud collaboration for VR task rendering, in which foreground interaction is offloaded to an edge server for rendering, while the background environment is rendered by the cloud server. Correspondingly, the VR device of a user is only responsible for decoding and displaying. Second, we propose the “horizontal” multi-user collaboration based on edge–edge cooperation, which synchronizes the data among edge servers. Finally, we implement the proposed MUCVR on an MVR device and the Unity VR application engine. The results show that MUCVR can effectively reduce the MVR service latency, improve the rendering performance, reduce the computing load on the VR device, and, ultimately, improve users’ quality of experience.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2058-2072"},"PeriodicalIF":6.0,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144843117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Decentralized QoS-Aware Model Inference Using Federated Split Learning for Cloud-Edge Medical Detection 基于联邦分裂学习的分布式qos感知模型推理用于云边缘医疗检测
IF 6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-01 DOI: 10.1109/TPDS.2025.3594694
Yishan Chen;Xiangwei Zeng;Huashuai Cai;Qing Xu;Zhiquan Liu
{"title":"Decentralized QoS-Aware Model Inference Using Federated Split Learning for Cloud-Edge Medical Detection","authors":"Yishan Chen;Xiangwei Zeng;Huashuai Cai;Qing Xu;Zhiquan Liu","doi":"10.1109/TPDS.2025.3594694","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3594694","url":null,"abstract":"The application of federated learning (FL) has been widely extended to medical domains, including medical image analysis and health monitoring. With the increasing computation power demand on edge devices, split federated learning has emerged as a promising FL architecture. In this work, a home healthcare monitoring scenario is explored. Unlike existing split federated learning studies that primarily focus on model-level optimization, this study considers a system-level optimization involving latency, packet error rate, and federated training time. Specifically, a <italic>k</i>-means algorithm is presented to select inference nodes, participating training clients, and aggregation servers referring to network conditions and data quality. Furthermore, a reinforcement learning method is utilized to allocate the computation and bandwidth resources during inference, training, and aggregation, thereby further improving the quality of service (QoS) and training efficiency. Simulation results demonstrate that the proposed architecture can achieve the target accuracy while offering the enhanced QoS and reduced the FL training time.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2119-2136"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic Multiresource Fair Allocation With Time Discount Utility 具有时间折扣效用的动态多资源公平分配
IF 6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-01 DOI: 10.1109/TPDS.2025.3594741
Bin Deng;Weidong Li
{"title":"Dynamic Multiresource Fair Allocation With Time Discount Utility","authors":"Bin Deng;Weidong Li","doi":"10.1109/TPDS.2025.3594741","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3594741","url":null,"abstract":"Multiresource allocation mechanisms have been studied in many scenarios. A new dynamic multiresource fair allocation model with time discount utility is proposed in this article, where users can arrive and depart at different time slots. We propose a new <italic>any price share</i> time discount (APS-TD) mechanism for this model, which accounts for the users’ time discount utility while maintaining desirable properties. We prove that the APS-TD mechanism satisfies cumulative incentive sharing (CSI), i.e., that the cumulative utility of each user is not lower than the cumulative utility generated by evenly allocating the available resources in each time slot; cumulative strategyproofness (CSP), where users cannot increase their cumulative utility by falsely reporting their demands in any time slot; cumulative Pareto optimality (CPO), i.e., where no allocation can increase the cumulative utility of one user without reducing the cumulative utility of another user in any time slot; cumulative envy-freeness (CEF), where users who arrive later should not prefer allocations from other users who arrive first in any time slot; time discount share fairness (TDSF), where users with higher time discount values occupy larger resource shares in each time slot unless the utility levels of both users are generated by evenly allocating resources; and bottleneck fairness (BF), where the allocation should satisfy max-min fairness with respect to the bottleneck resources contained in each time slot. We run the APS-TD mechanism on Alibaba trace-driven data to demonstrate the performance enhancement achieved by our proposed mechanism over the existing mechanism extensions. The results show that the APS-TD mechanism is superior to hybrid multiresource fairness (H-MRF) and stateful dominant resource fairness (SDRF) in many ways.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2089-2103"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144868146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallelization of Network Dynamics Computations in Heterogeneous Distributed Environment 异构分布环境下网络动力学计算的并行化
IF 6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-07-28 DOI: 10.1109/TPDS.2025.3593154
Oleksandr Sudakov;Volodymyr Maistrenko
{"title":"Parallelization of Network Dynamics Computations in Heterogeneous Distributed Environment","authors":"Oleksandr Sudakov;Volodymyr Maistrenko","doi":"10.1109/TPDS.2025.3593154","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3593154","url":null,"abstract":"This paper addresses the problem of parallelizing computations to study nonlinear dynamics in large networks of non-locally coupled oscillators using heterogeneous computing resources. The proposed approach can be applied to a variety of nonlinear dynamics models with runtime specification of parameters and network topologies. Parallelizing the solution of equations for different network elements is performed transparently and, in contrast to available tools, does not require parallel programming from end-users. The runtime scheduler takes into account the performance of computing and communication resources to reduce downtime and to achieve a quasi-optimal parallelizing speed-up. The proposed approach was implemented, and its efficiency is proven by numerous applications for simulating large dynamical networks with 10<sup>3</sup>-10<sup>8</sup> elements described by Hodgkin–Huxley, FitzHugh–Nagumo, and Kuramoto models, for investigating pathological synchronization during Parkinson’s disease, analyzing multi-stability, for studying chimera and solitary states in 3D networks, etc. All the above computations may be performed using symmetrical multiprocessors, graphic processing units, and a network of workstations within the same run and it was demonstrated that near-linear speed-up can be achieved for large networks. The proposed approach is promising for extension to new hardware like edge-computing devices.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2030-2044"},"PeriodicalIF":6.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance Portability Assessment in Gaia Gaia的性能可移植性评估
IF 6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-07-22 DOI: 10.1109/TPDS.2025.3591452
Giulio Malenza;Valentina Cesare;Marco Edoardo Santimaria;Robert Birke;Alberto Vecchiato;Ugo Becciani;Marco Aldinucci
{"title":"Performance Portability Assessment in Gaia","authors":"Giulio Malenza;Valentina Cesare;Marco Edoardo Santimaria;Robert Birke;Alberto Vecchiato;Ugo Becciani;Marco Aldinucci","doi":"10.1109/TPDS.2025.3591452","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3591452","url":null,"abstract":"Modern scientific experiments produce ever-increasing amounts of data, soon requiring ExaFLOPs computing capacities for analysis. Reaching such performance requires purpose-built supercomputers with <inline-formula><tex-math>$O(10^{3})$</tex-math></inline-formula> nodes, each hosting multicore CPUs and multiple GPUs, and applications designed to exploit this hardware optimally. Given that each supercomputer is generally a one-off project, the need for computing frameworks portable across diverse CPU and GPU architectures without performance losses is increasingly compelling. We investigate the performance portability (<inline-graphic>) of a real-world application: the solver module of the AVU–GSR pipeline for the ESA Gaia mission. This code finds the astrometric parameters of <inline-formula><tex-math>${sim} 10^{8}$</tex-math></inline-formula> stars in the Milky Way using the LSQR iterative algorithm. LSQR is widely used to solve linear systems of equations across a wide range of high-performance computing applications, elevating the study beyond its astrophysical relevance. The code is memory-bound, with six main compute kernels implementing sparse matrix-by-vector products. We optimize the previous CUDA implementation and port the code to further six GPU-acceleration frameworks: C++ PSTL, SYCL, OpenMP, HIP, KOKKOS, and OpenACC. We evaluate each framework’s performance portability across multiple GPUs (NVIDIA and AMD) and problem sizes in terms of application and architectural efficiency. Architectural efficiency is estimated through the roofline model of the six most computationally expensive GPU kernels. Our results show that C++ library-based (C++ PSTL and KOKKOS), pragma-based (OpenMP and OpenACC), and language-specific (CUDA, HIP, and SYCL) frameworks achieve increasingly better performance portability across the supported platforms with larger problem sizes providing better <inline-graphic> scores due to higher GPU occupancies.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2045-2057"},"PeriodicalIF":6.0,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11090032","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Task Scheduling in Geo-Distributed Computing: A Survey 地理分布式计算中的任务调度研究
IF 6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-07-21 DOI: 10.1109/TPDS.2025.3591010
Yujian Wu;Shanjiang Tang;Ce Yu;Bin Yang;Chao Sun;Jian Xiao;Hutong Wu;Jinghua Feng
{"title":"Task Scheduling in Geo-Distributed Computing: A Survey","authors":"Yujian Wu;Shanjiang Tang;Ce Yu;Bin Yang;Chao Sun;Jian Xiao;Hutong Wu;Jinghua Feng","doi":"10.1109/TPDS.2025.3591010","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3591010","url":null,"abstract":"Geo-distributed computing, a paradigm that assigns computational tasks to globally distributed nodes, has emerged as a promising approach in cloud computing, edge computing, cloud-edge computing, and supercomputer computing (SC). It enables low-latency services, ensures data locality, and handles large-scale applications. As global computing capacity and task demands increase rapidly, scheduling tasks for efficient execution in geo-distributed computing systems has become an increasingly critical research challenge. It arises from the inherent characteristics of geographic distribution, including heterogeneous network conditions, region-specific resource pricing, and varying computational capabilities across locations. Researchers have developed diverse task scheduling methods tailored to geo-distributed scenarios, aiming to achieve objectives such as performance enhancement, fairness assurance, and fault-tolerance improvement. This survey provides a comprehensive and systematic review of task scheduling techniques across four major distributed computing environments, with an in-depth analysis of these approaches based on their core scheduling objectives. Through our analysis, we identify key research challenges and outline promising directions for advancing task scheduling in geo-distributed computing.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2073-2088"},"PeriodicalIF":6.0,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144867934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Half-Precision Seismic Simulation on Neural Processing Unit 神经处理单元加速半精度地震模拟
IF 6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-07-15 DOI: 10.1109/TPDS.2025.3584773
Yinuo Wang;Zeyu Song;Wubing Wan;Xinpeng Zhao;Lin Gan;Ping Gao;Wenqiang Wang;Zhenguo Zhang;Haohuan Fu;Wei Xue;Guangwen Yang
{"title":"Accelerating Half-Precision Seismic Simulation on Neural Processing Unit","authors":"Yinuo Wang;Zeyu Song;Wubing Wan;Xinpeng Zhao;Lin Gan;Ping Gao;Wenqiang Wang;Zhenguo Zhang;Haohuan Fu;Wei Xue;Guangwen Yang","doi":"10.1109/TPDS.2025.3584773","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3584773","url":null,"abstract":"Due to the superiority of handling irregular regions of interest, the curvilinear grid finite difference method (CGFDM) has become wildely used in seismic simulation for earthquake hazard evaluation and understanding of earthquake physics. This paper proposes a novel approach that optimizes a CGFDM solver on the Ascend, a cutting-edge Neural Processing Unit (NPU) using half-precision storage and mixed-precision arithmetic. The approach increases the data throughput and computing efficiency, enabling more effective seismic modeling. Furthermore, we propose an efficient matrix unit enabled 3D difference algorithm that employs matrix unit on NPU to accelerate the computation. By fully exploiting the capability of matrix unit and wide SIMD lane, our solver on Ascend achieves a speedup of 4.19 × over the performance of parallel solver on two AMD CPUs and has successfully simulated real-world Wenchuan earthquake. To the best of our knowledge, we are the first to conduct seismic simulations on NPU.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"1998-2013"},"PeriodicalIF":6.0,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High Performance OpenCL-Based GEMM Kernel Auto-Tuned by Bayesian Optimization 基于贝叶斯优化的高性能opencl gem内核自调优
IF 6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-07-10 DOI: 10.1109/TPDS.2025.3587673
Shengle Lin;Guoqing Xiao;Haotian Wang;Wangdong Yang;Kenli Li;Keqin Li
{"title":"High Performance OpenCL-Based GEMM Kernel Auto-Tuned by Bayesian Optimization","authors":"Shengle Lin;Guoqing Xiao;Haotian Wang;Wangdong Yang;Kenli Li;Keqin Li","doi":"10.1109/TPDS.2025.3587673","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3587673","url":null,"abstract":"OpenCL has become the favored framework for emerging heterogeneous devices and FPGAs, owing to its versatility and portability. However, OpenCL-based math libraries still face challenges in fully leveraging device performance. When deploying high-performance arithmetic applications on these devices, the most important hot function is General Matrix-matrix Multiplication (GEMM). This study presents a meticulously optimized OpenCL GEMM kernel. Our enhanced GEMM kernel emphasizes two key improvements: 1) a three-level double buffer pipeline that efficiently overlaps data fetching with floating-point computations; 2) a fine-grained prefetching strategy of private memory to increase device occupancy by optimizing register unit utilization. Furthermore, this work presents a Bayesian Optimization (BO) tuner for kernel auto-tuning. Experimental results demonstrate considerable optimization improvement and performance advantages achieved on diverse OpenCL devices. Additionally, the BO tuner demonstrates superior efficiency and robustness, outperforming contemporary tuning methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1985-1997"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144782079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
m$^{2}$2LLM: A Multi-Dimensional Optimization Framework for LLM Inference on Mobile Devices m$^{2}$2LLM:移动设备上LLM推理的多维优化框架
IF 6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-07-10 DOI: 10.1109/TPDS.2025.3587445
Kaiyuan Liu;Xiaobo Zhou;Li Li
{"title":"m$^{2}$2LLM: A Multi-Dimensional Optimization Framework for LLM Inference on Mobile Devices","authors":"Kaiyuan Liu;Xiaobo Zhou;Li Li","doi":"10.1109/TPDS.2025.3587445","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3587445","url":null,"abstract":"Large Language Models (LLMs) are reshaping mobile AI. Directly deploying LLMs on mobile devices is an emerging paradigm that can widely support different mobile applications while preserving data privacy. However, intensive memory footprint, long inference latency and high energy consumption severely bottlenecks on-device inference of LLM in real-world scenarios. In response to these challenges, this work introduces m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM, an innovative framework that performs joint optimization from multiple dimensions for on-device LLM inference in order to strike a balance among performance, realtimeliness and energy efficiency. Specifically, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM features the following four core components including : 1) Hardware-aware Model Customization, 2) Elastic Chunk-wise Pipeline, 3) Latency-guided Prompt Compression and 4) Layer-wise Resource Scheduling. These four components interact with each other in order to guide the inference process from the following three dimensions. At the model level, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM designs an elastic chunk-wise pipeline to expand device memory and customize the model according to the hardware configuration, maximizing performance within the memory budget. At the prompt level, facing the stochastic input, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM judiciously compresses the prompts in order to guarantee the first token can be generated in time while maintaining the semantic information. Additionally, at the system level, the layer-wise resource scheduler is employed in order to complete the token generation process with minimized energy consumption while guaranteeing the realtimeness in the highly dynamic mobile environment. m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM is evaluated on off-the-shelf smartphone with represented models and datasets. Compared to baseline methods, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM delivers 2.99–13.5× TTFT acceleration and 2.28–24.3× energy savings, with only a minimal model performance loss of 2% –7% .","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2014-2029"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信