Chasing Common Knowledge: Joint Large Model Selection and Pulling in MEC With Parameter Sharing

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-01-08 DOI:10.1109/TPDS.2025.3527649

Lizhen Zhou;Zichuan Xu;Qiufen Xia;Zhou Xu;Wenhao Ren;Wenbo Qi;Jinjing Ma;Song Yan;Yuan Yang

{"title":"Chasing Common Knowledge: Joint Large Model Selection and Pulling in MEC With Parameter Sharing","authors":"Lizhen Zhou;Zichuan Xu;Qiufen Xia;Zhou Xu;Wenhao Ren;Wenbo Qi;Jinjing Ma;Song Yan;Yuan Yang","doi":"10.1109/TPDS.2025.3527649","DOIUrl":null,"url":null,"abstract":"Pretrained Foundation Models (PFMs) are regarded as a promising accelerator for the development of various Artificial Intelligence (AI) applications, and have recently been widely fine-tuned to satisfy users’ personalized inference demands. As many users are attracted to PFM-based AI applications, remote data centers are increasingly unable to solely bear the enormous computational demands and meet the delay requirements of inference requests. Mobile edge computing (MEC) offers a viable solution for delivering low-latency inference services by pulling fine-tuned PFMs from the remote data center to cloudlets in the proximity of users. However, a fine-tuned PFM typically comprises billions of model parameters, which are highly resource-intensive, time-consuming, and cost-prohibitive to execute at the edge. To address this, we investigate a novel joint large model selection and pulling problem in MEC networks. The novelty of our study lies in exploring parameter sharing among fine-tuned PFMs based on their common knowledge. Specifically, we first formulate a Non-Linear Integer Programming (NLIP) for the problem to minimize the total delay of implementing all inference requests. We then transform the NLIP into an equivalent Integer Linear Program (ILP) that is much simpler to solve. We further propose a randomized algorithm with a provable approximation ratio for the problem. We also consider the online version of the problem with uncertain request demand, and develop an online learning algorithm with a bounded regret. The crux of the online algorithm is the adoption of the multi-armed bandit technique with restricted context for dynamic admissions of inference requests. We finally conduct extensive experiments based on real datasets. Experimental results demonstrate that our algorithms reduce at least 38% in total delays and average costs, while achieving a 5% improvement in average accuracies.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"437-454"},"PeriodicalIF":5.6000,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10834568/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Pretrained Foundation Models (PFMs) are regarded as a promising accelerator for the development of various Artificial Intelligence (AI) applications, and have recently been widely fine-tuned to satisfy users’ personalized inference demands. As many users are attracted to PFM-based AI applications, remote data centers are increasingly unable to solely bear the enormous computational demands and meet the delay requirements of inference requests. Mobile edge computing (MEC) offers a viable solution for delivering low-latency inference services by pulling fine-tuned PFMs from the remote data center to cloudlets in the proximity of users. However, a fine-tuned PFM typically comprises billions of model parameters, which are highly resource-intensive, time-consuming, and cost-prohibitive to execute at the edge. To address this, we investigate a novel joint large model selection and pulling problem in MEC networks. The novelty of our study lies in exploring parameter sharing among fine-tuned PFMs based on their common knowledge. Specifically, we first formulate a Non-Linear Integer Programming (NLIP) for the problem to minimize the total delay of implementing all inference requests. We then transform the NLIP into an equivalent Integer Linear Program (ILP) that is much simpler to solve. We further propose a randomized algorithm with a provable approximation ratio for the problem. We also consider the online version of the problem with uncertain request demand, and develop an online learning algorithm with a bounded regret. The crux of the online algorithm is the adoption of the multi-armed bandit technique with restricted context for dynamic admissions of inference requests. We finally conduct extensive experiments based on real datasets. Experimental results demonstrate that our algorithms reduce at least 38% in total delays and average costs, while achieving a 5% improvement in average accuracies.

查看原文本刊更多论文

追求共同知识：参数共享的MEC联合大模型选择与抽取

预训练基础模型（PFMs）被认为是各种人工智能（AI）应用开发的一个有前途的加速器，最近被广泛微调以满足用户的个性化推理需求。随着越来越多的用户被基于pfm的AI应用所吸引，远程数据中心越来越无法单独承担巨大的计算需求和满足推理请求的延迟要求。移动边缘计算（MEC）通过将微调的pfm从远程数据中心拉到用户附近的云，为交付低延迟推理服务提供了可行的解决方案。然而，一个微调的PFM通常包含数十亿个模型参数，这是高度资源密集型的，耗时的，并且在边缘执行成本过高。为了解决这个问题，我们研究了MEC网络中一个新的联合大模型选择和提取问题。本研究的新颖之处在于基于它们的共同知识来探索微调pfm之间的参数共享。具体来说，我们首先为该问题制定了非线性整数规划（NLIP），以最小化实现所有推理请求的总延迟。然后，我们将NLIP转换成一个等效的整数线性规划（ILP），求解起来简单得多。我们进一步提出了一个具有可证明近似比的随机化算法。我们还考虑了请求需求不确定问题的在线版本，并开发了一种具有有限遗憾的在线学习算法。该在线算法的核心是采用限制上下文的多臂强盗技术对推理请求进行动态接纳。最后，我们基于真实的数据集进行了广泛的实验。实验结果表明，我们的算法在总延迟和平均成本上降低了至少38%，而在平均精度上提高了5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.