Daliang Xu;Wangsong Yin;Hao Zhang;Xin Jin;Ying Zhang;Shiyun Wei;Mengwei Xu;Xuanzhe Liu
{"title":"EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding","authors":"Daliang Xu;Wangsong Yin;Hao Zhang;Xin Jin;Ying Zhang;Shiyun Wei;Mengwei Xu;Xuanzhe Liu","doi":"10.1109/TMC.2024.3513457","DOIUrl":null,"url":null,"abstract":"Generative tasks, such as text generation and question answering, are essential for mobile applications. Given their inherent privacy sensitivity, executing them on devices is demanded. Nowadays, the execution of these generative tasks heavily relies on the Large Language Models (LLMs). However, the scarce device memory severely hinders the scalability of these models. We present <monospace>EdgeLLM</monospace>, an efficient on-device LLM inference system for models whose sizes exceed the device's memory capacity. <monospace>EdgeLLM</monospace> is built atop speculative decoding, which delegates most tokens to a smaller, memory-resident (draft) LLM. <monospace>EdgeLLM</monospace> integrates three novel techniques: (1) Instead of generating a fixed width and depth token tree, <monospace>EdgeLLM</monospace> proposes compute-efficient branch navigation and verification to pace the progress of different branches according to their accepted probability to prevent the wasteful allocation of computing resources to the wrong branch and to verify them all at once efficiently. (2) It uses a self-adaptive fallback strategy that promptly initiates the verification process when the smaller LLM generates an incorrect token. (3) To not block the generation, <monospace>EdgeLLM</monospace> proposes speculatively generating tokens during large LLM verification with the compute-IO pipeline. Through extensive experiments, <monospace>EdgeLLM</monospace> exhibits impressive token generation speed which is up to 9.3× faster than existing engines.","PeriodicalId":50389,"journal":{"name":"IEEE Transactions on Mobile Computing","volume":"24 4","pages":"3256-3273"},"PeriodicalIF":7.7000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Mobile Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10812936/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Generative tasks, such as text generation and question answering, are essential for mobile applications. Given their inherent privacy sensitivity, executing them on devices is demanded. Nowadays, the execution of these generative tasks heavily relies on the Large Language Models (LLMs). However, the scarce device memory severely hinders the scalability of these models. We present EdgeLLM, an efficient on-device LLM inference system for models whose sizes exceed the device's memory capacity. EdgeLLM is built atop speculative decoding, which delegates most tokens to a smaller, memory-resident (draft) LLM. EdgeLLM integrates three novel techniques: (1) Instead of generating a fixed width and depth token tree, EdgeLLM proposes compute-efficient branch navigation and verification to pace the progress of different branches according to their accepted probability to prevent the wasteful allocation of computing resources to the wrong branch and to verify them all at once efficiently. (2) It uses a self-adaptive fallback strategy that promptly initiates the verification process when the smaller LLM generates an incorrect token. (3) To not block the generation, EdgeLLM proposes speculatively generating tokens during large LLM verification with the compute-IO pipeline. Through extensive experiments, EdgeLLM exhibits impressive token generation speed which is up to 9.3× faster than existing engines.
期刊介绍:
IEEE Transactions on Mobile Computing addresses key technical issues related to various aspects of mobile computing. This includes (a) architectures, (b) support services, (c) algorithm/protocol design and analysis, (d) mobile environments, (e) mobile communication systems, (f) applications, and (g) emerging technologies. Topics of interest span a wide range, covering aspects like mobile networks and hosts, mobility management, multimedia, operating system support, power management, online and mobile environments, security, scalability, reliability, and emerging technologies such as wearable computers, body area networks, and wireless sensor networks. The journal serves as a comprehensive platform for advancements in mobile computing research.