TPI-LLM：在低资源移动设备上高效地服务70b级llm

IF 5.8 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Services Computing Pub Date : 2025-08-08 DOI:10.1109/TSC.2025.3596892

Zonghang Li;Wenjiao Feng;Mohsen Guizani;Hongfang Yu

{"title":"TPI-LLM：在低资源移动设备上高效地服务70b级llm","authors":"Zonghang Li;Wenjiao Feng;Mohsen Guizani;Hongfang Yu","doi":"10.1109/TSC.2025.3596892","DOIUrl":null,"url":null,"abstract":"LLM serving is shifting from cloud to edge due to privacy concerns over user interaction data. However, mobile devices struggle with very limited computing power and memory, requiring collaboration among multiple devices to run LLM apps. The mainstream solution, pipeline parallelism, is inefficient for such cases because mobile devices typically run only one inference task at a time. This article argues that tensor parallelism, despite its high communication cost, can better fit such scenarios. We introduce TPI-LLM, a compute and memory-efficient tensor parallel inference system designed to run 70B-scale LLMs on low-resource mobile devices. It keeps sensitive raw data local on users’ devices and employs a sliding window memory scheduler to dynamically manage layer weights. It overlaps disk I/O with computation and communication, enabling efficient operation of large models on memory-limited devices. Extensive experiments show that TPI-LLM reduces token latency by 80%–90% compared to Transformers, Accelerate, and Galaxy. It also cuts the peak memory footprint by 90%, requiring just 3.1 GiB of memory for 70B-scale models.","PeriodicalId":13255,"journal":{"name":"IEEE Transactions on Services Computing","volume":"18 5","pages":"3321-3333"},"PeriodicalIF":5.8000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TPI-LLM: Serving 70B-Scale LLMs Efficiently on Low-Resource Mobile Devices\",\"authors\":\"Zonghang Li;Wenjiao Feng;Mohsen Guizani;Hongfang Yu\",\"doi\":\"10.1109/TSC.2025.3596892\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"LLM serving is shifting from cloud to edge due to privacy concerns over user interaction data. However, mobile devices struggle with very limited computing power and memory, requiring collaboration among multiple devices to run LLM apps. The mainstream solution, pipeline parallelism, is inefficient for such cases because mobile devices typically run only one inference task at a time. This article argues that tensor parallelism, despite its high communication cost, can better fit such scenarios. We introduce TPI-LLM, a compute and memory-efficient tensor parallel inference system designed to run 70B-scale LLMs on low-resource mobile devices. It keeps sensitive raw data local on users’ devices and employs a sliding window memory scheduler to dynamically manage layer weights. It overlaps disk I/O with computation and communication, enabling efficient operation of large models on memory-limited devices. Extensive experiments show that TPI-LLM reduces token latency by 80%–90% compared to Transformers, Accelerate, and Galaxy. It also cuts the peak memory footprint by 90%, requiring just 3.1 GiB of memory for 70B-scale models.\",\"PeriodicalId\":13255,\"journal\":{\"name\":\"IEEE Transactions on Services Computing\",\"volume\":\"18 5\",\"pages\":\"3321-3333\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2025-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Services Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11119787/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Services Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11119787/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

由于对用户交互数据的隐私担忧，LLM服务正在从云转向边缘。然而，移动设备的计算能力和内存非常有限，需要多个设备之间的协作才能运行LLM应用程序。对于这种情况，主流的解决方案管道并行是低效的，因为移动设备一次通常只运行一个推理任务。本文认为，尽管张量并行的通信成本很高，但它可以更好地适应这种情况。我们介绍了TPI-LLM，一个计算和内存效率高的张量并行推理系统，设计用于在低资源移动设备上运行70b规模的llm。它将敏感的原始数据保存在用户设备的本地，并使用滑动窗口内存调度器来动态管理层权重。它使磁盘I/O与计算和通信重叠，从而能够在内存有限的设备上高效地操作大型模型。大量实验表明，与Transformers、Accelerate和Galaxy相比，TPI-LLM将令牌延迟减少了80%-90%。它还减少了90%的峰值内存占用，对于70b规模的模型只需要3.1 gb的内存。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TPI-LLM: Serving 70B-Scale LLMs Efficiently on Low-Resource Mobile Devices

LLM serving is shifting from cloud to edge due to privacy concerns over user interaction data. However, mobile devices struggle with very limited computing power and memory, requiring collaboration among multiple devices to run LLM apps. The mainstream solution, pipeline parallelism, is inefficient for such cases because mobile devices typically run only one inference task at a time. This article argues that tensor parallelism, despite its high communication cost, can better fit such scenarios. We introduce TPI-LLM, a compute and memory-efficient tensor parallel inference system designed to run 70B-scale LLMs on low-resource mobile devices. It keeps sensitive raw data local on users’ devices and employs a sliding window memory scheduler to dynamically manage layer weights. It overlaps disk I/O with computation and communication, enabling efficient operation of large models on memory-limited devices. Extensive experiments show that TPI-LLM reduces token latency by 80%–90% compared to Transformers, Accelerate, and Galaxy. It also cuts the peak memory footprint by 90%, requiring just 3.1 GiB of memory for 70B-scale models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Services Computing COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

11.50

自引率

6.20%

发文量

278

审稿时长

>12 weeks

期刊介绍： IEEE Transactions on Services Computing encompasses the computing and software aspects of the science and technology of services innovation research and development. It places emphasis on algorithmic, mathematical, statistical, and computational methods central to services computing. Topics covered include Service Oriented Architecture, Web Services, Business Process Integration, Solution Performance Management, and Services Operations and Management. The transactions address mathematical foundations, security, privacy, agreement, contract, discovery, negotiation, collaboration, and quality of service for web services. It also covers areas like composite web service creation, business and scientific applications, standards, utility models, business process modeling, integration, collaboration, and more in the realm of Services Computing.