EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding

IF 7.7 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Mobile Computing Pub Date : 2024-12-23 DOI:10.1109/TMC.2024.3513457

Daliang Xu;Wangsong Yin;Hao Zhang;Xin Jin;Ying Zhang;Shiyun Wei;Mengwei Xu;Xuanzhe Liu

{"title":"EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding","authors":"Daliang Xu;Wangsong Yin;Hao Zhang;Xin Jin;Ying Zhang;Shiyun Wei;Mengwei Xu;Xuanzhe Liu","doi":"10.1109/TMC.2024.3513457","DOIUrl":null,"url":null,"abstract":"Generative tasks, such as text generation and question answering, are essential for mobile applications. Given their inherent privacy sensitivity, executing them on devices is demanded. Nowadays, the execution of these generative tasks heavily relies on the Large Language Models (LLMs). However, the scarce device memory severely hinders the scalability of these models. We present <monospace>EdgeLLM</monospace>, an efficient on-device LLM inference system for models whose sizes exceed the device's memory capacity. <monospace>EdgeLLM</monospace> is built atop speculative decoding, which delegates most tokens to a smaller, memory-resident (draft) LLM. <monospace>EdgeLLM</monospace> integrates three novel techniques: (1) Instead of generating a fixed width and depth token tree, <monospace>EdgeLLM</monospace> proposes compute-efficient branch navigation and verification to pace the progress of different branches according to their accepted probability to prevent the wasteful allocation of computing resources to the wrong branch and to verify them all at once efficiently. (2) It uses a self-adaptive fallback strategy that promptly initiates the verification process when the smaller LLM generates an incorrect token. (3) To not block the generation, <monospace>EdgeLLM</monospace> proposes speculatively generating tokens during large LLM verification with the compute-IO pipeline. Through extensive experiments, <monospace>EdgeLLM</monospace> exhibits impressive token generation speed which is up to 9.3× faster than existing engines.","PeriodicalId":50389,"journal":{"name":"IEEE Transactions on Mobile Computing","volume":"24 4","pages":"3256-3273"},"PeriodicalIF":7.7000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Mobile Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10812936/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Generative tasks, such as text generation and question answering, are essential for mobile applications. Given their inherent privacy sensitivity, executing them on devices is demanded. Nowadays, the execution of these generative tasks heavily relies on the Large Language Models (LLMs). However, the scarce device memory severely hinders the scalability of these models. We present EdgeLLM, an efficient on-device LLM inference system for models whose sizes exceed the device's memory capacity. EdgeLLM is built atop speculative decoding, which delegates most tokens to a smaller, memory-resident (draft) LLM. EdgeLLM integrates three novel techniques: (1) Instead of generating a fixed width and depth token tree, EdgeLLM proposes compute-efficient branch navigation and verification to pace the progress of different branches according to their accepted probability to prevent the wasteful allocation of computing resources to the wrong branch and to verify them all at once efficiently. (2) It uses a self-adaptive fallback strategy that promptly initiates the verification process when the smaller LLM generates an incorrect token. (3) To not block the generation, EdgeLLM proposes speculatively generating tokens during large LLM verification with the compute-IO pipeline. Through extensive experiments, EdgeLLM exhibits impressive token generation speed which is up to 9.3× faster than existing engines.

查看原文本刊更多论文

EdgeLLM：基于推测解码的快速设备上LLM推理

生成任务，如文本生成和问题回答，对于移动应用程序是必不可少的。考虑到它们固有的隐私敏感性，需要在设备上执行它们。目前，这些生成任务的执行严重依赖于大型语言模型（llm）。然而，稀缺的设备内存严重阻碍了这些模型的可扩展性。我们提出了EdgeLLM，一个有效的设备上的LLM推理系统，用于模型的大小超过设备的内存容量。EdgeLLM建立在推测解码的基础上，它将大多数令牌委托给一个更小的内存驻留（草案）LLM。EdgeLLM集成了三种新技术：(1)EdgeLLM不是生成固定宽度和深度的令牌树，而是提出了计算效率高的分支导航和验证，根据不同分支的可接受概率来调整不同分支的进度，以防止将计算资源浪费在错误的分支上，并有效地一次验证它们。(2)它使用自适应回退策略，当较小的LLM生成不正确的令牌时，立即启动验证过程。(3)为了不阻止代币的生成，EdgeLLM建议在使用compute-IO管道进行大型LLM验证时推测性地生成代币。通过大量的实验，EdgeLLM显示出令人印象深刻的代币生成速度，比现有引擎快9.3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Mobile Computing 工程技术-电信学

CiteScore

12.90

自引率

2.50%

发文量

403

审稿时长

6.6 months

期刊介绍： IEEE Transactions on Mobile Computing addresses key technical issues related to various aspects of mobile computing. This includes (a) architectures, (b) support services, (c) algorithm/protocol design and analysis, (d) mobile environments, (e) mobile communication systems, (f) applications, and (g) emerging technologies. Topics of interest span a wide range, covering aspects like mobile networks and hosts, mobility management, multimedia, operating system support, power management, online and mobile environments, security, scalability, reliability, and emerging technologies such as wearable computers, body area networks, and wireless sensor networks. The journal serves as a comprehensive platform for advancements in mobile computing research.