ELMS:移动设备上的弹性大型语言模型

Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu
{"title":"ELMS:移动设备上的弹性大型语言模型","authors":"Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu","doi":"arxiv-2409.09071","DOIUrl":null,"url":null,"abstract":"On-device Large Language Models (LLMs) are revolutionizing mobile AI,\nenabling applications such as UI automation while addressing privacy concerns.\nCurrently, the standard approach involves deploying a single, robust LLM as a\nuniversal solution for various applications, often referred to as\nLLM-as-a-Service (LLMaaS). However, this approach faces a significant system\nchallenge: existing LLMs lack the flexibility to accommodate the diverse\nService-Level Objectives (SLOs) regarding inference latency across different\napplications. To address this issue, we introduce ELMS, an on-device LLM\nservice designed to provide elasticity in both the model and prompt dimensions\nof an LLMaaS. This system includes: A one-time neuron reordering technique,\nwhich utilizes the inherent permutation consistency within transformer models\nto create high-quality, elastic sub-models with minimal runtime switching\ncosts. A dual-head compact language model, which efficiently refines prompts\nand coordinates the elastic adaptation between the model and the prompt. We\nhave implemented this elastic on-device LLM service on several off-the-shelf\n(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent\ndatasets and synthesized end-to-end traces. Across a range of SLOs, ELMS\nsurpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy\non average, with less than 1% Time-To-First-Token (TTFT) switching overhead,\ncomparable memory usage, and fewer than 100 offline GPU hours.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ELMS: Elasticized Large Language Models On Mobile Devices\",\"authors\":\"Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu\",\"doi\":\"arxiv-2409.09071\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"On-device Large Language Models (LLMs) are revolutionizing mobile AI,\\nenabling applications such as UI automation while addressing privacy concerns.\\nCurrently, the standard approach involves deploying a single, robust LLM as a\\nuniversal solution for various applications, often referred to as\\nLLM-as-a-Service (LLMaaS). However, this approach faces a significant system\\nchallenge: existing LLMs lack the flexibility to accommodate the diverse\\nService-Level Objectives (SLOs) regarding inference latency across different\\napplications. To address this issue, we introduce ELMS, an on-device LLM\\nservice designed to provide elasticity in both the model and prompt dimensions\\nof an LLMaaS. This system includes: A one-time neuron reordering technique,\\nwhich utilizes the inherent permutation consistency within transformer models\\nto create high-quality, elastic sub-models with minimal runtime switching\\ncosts. A dual-head compact language model, which efficiently refines prompts\\nand coordinates the elastic adaptation between the model and the prompt. We\\nhave implemented this elastic on-device LLM service on several off-the-shelf\\n(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent\\ndatasets and synthesized end-to-end traces. Across a range of SLOs, ELMS\\nsurpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy\\non average, with less than 1% Time-To-First-Token (TTFT) switching overhead,\\ncomparable memory usage, and fewer than 100 offline GPU hours.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"31 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09071\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

设备上的大型语言模型(LLM)正在彻底改变移动人工智能,使用户界面自动化等应用成为可能,同时解决了隐私问题。目前,标准方法包括部署一个单一、强大的 LLM,作为各种应用的通用解决方案,通常称为 LLM 即服务(LLMaaS)。然而,这种方法面临着一个重大的系统挑战:现有的 LLM 缺乏灵活性,无法满足不同应用在推理延迟方面的不同服务级别目标(SLO)。为了解决这个问题,我们推出了 ELMS,这是一种设备上的 LLM 服务,旨在为 LLMaaS 的模型和提示维度提供弹性。该系统包括一次性神经元重排序技术,它利用变压器模型中固有的排列一致性来创建高质量的弹性子模型,并将运行时的切换成本降至最低。双头紧凑语言模型,可高效地完善提示语,并协调模型与提示语之间的弹性适应。我们在几款现成的(COTS)智能手机上实现了这种弹性的设备上 LLM 服务,并使用独立的 NLP/移动标记数据集和合成的端到端跟踪对 ELMS 进行了评估。在一系列SLO中,ELMS的绝对准确率超过了四种强大的基线,平均高达16.83%和11.04%,而首次令牌时间(TTFT)切换开销不到1%,内存使用量相当,离线GPU时长不到100小时。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
ELMS: Elasticized Large Language Models On Mobile Devices
On-device Large Language Models (LLMs) are revolutionizing mobile AI, enabling applications such as UI automation while addressing privacy concerns. Currently, the standard approach involves deploying a single, robust LLM as a universal solution for various applications, often referred to as LLM-as-a-Service (LLMaaS). However, this approach faces a significant system challenge: existing LLMs lack the flexibility to accommodate the diverse Service-Level Objectives (SLOs) regarding inference latency across different applications. To address this issue, we introduce ELMS, an on-device LLM service designed to provide elasticity in both the model and prompt dimensions of an LLMaaS. This system includes: A one-time neuron reordering technique, which utilizes the inherent permutation consistency within transformer models to create high-quality, elastic sub-models with minimal runtime switching costs. A dual-head compact language model, which efficiently refines prompts and coordinates the elastic adaptation between the model and the prompt. We have implemented this elastic on-device LLM service on several off-the-shelf (COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent datasets and synthesized end-to-end traces. Across a range of SLOs, ELMS surpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy on average, with less than 1% Time-To-First-Token (TTFT) switching overhead, comparable memory usage, and fewer than 100 offline GPU hours.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信