{"title":"TPI-LLM: Serving 70B-Scale LLMs Efficiently on Low-Resource Mobile Devices","authors":"Zonghang Li;Wenjiao Feng;Mohsen Guizani;Hongfang Yu","doi":"10.1109/TSC.2025.3596892","DOIUrl":null,"url":null,"abstract":"LLM serving is shifting from cloud to edge due to privacy concerns over user interaction data. However, mobile devices struggle with very limited computing power and memory, requiring collaboration among multiple devices to run LLM apps. The mainstream solution, pipeline parallelism, is inefficient for such cases because mobile devices typically run only one inference task at a time. This article argues that tensor parallelism, despite its high communication cost, can better fit such scenarios. We introduce TPI-LLM, a compute and memory-efficient tensor parallel inference system designed to run 70B-scale LLMs on low-resource mobile devices. It keeps sensitive raw data local on users’ devices and employs a sliding window memory scheduler to dynamically manage layer weights. It overlaps disk I/O with computation and communication, enabling efficient operation of large models on memory-limited devices. Extensive experiments show that TPI-LLM reduces token latency by 80%–90% compared to Transformers, Accelerate, and Galaxy. It also cuts the peak memory footprint by 90%, requiring just 3.1 GiB of memory for 70B-scale models.","PeriodicalId":13255,"journal":{"name":"IEEE Transactions on Services Computing","volume":"18 5","pages":"3321-3333"},"PeriodicalIF":5.8000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Services Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11119787/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
LLM serving is shifting from cloud to edge due to privacy concerns over user interaction data. However, mobile devices struggle with very limited computing power and memory, requiring collaboration among multiple devices to run LLM apps. The mainstream solution, pipeline parallelism, is inefficient for such cases because mobile devices typically run only one inference task at a time. This article argues that tensor parallelism, despite its high communication cost, can better fit such scenarios. We introduce TPI-LLM, a compute and memory-efficient tensor parallel inference system designed to run 70B-scale LLMs on low-resource mobile devices. It keeps sensitive raw data local on users’ devices and employs a sliding window memory scheduler to dynamically manage layer weights. It overlaps disk I/O with computation and communication, enabling efficient operation of large models on memory-limited devices. Extensive experiments show that TPI-LLM reduces token latency by 80%–90% compared to Transformers, Accelerate, and Galaxy. It also cuts the peak memory footprint by 90%, requiring just 3.1 GiB of memory for 70B-scale models.
期刊介绍:
IEEE Transactions on Services Computing encompasses the computing and software aspects of the science and technology of services innovation research and development. It places emphasis on algorithmic, mathematical, statistical, and computational methods central to services computing. Topics covered include Service Oriented Architecture, Web Services, Business Process Integration, Solution Performance Management, and Services Operations and Management. The transactions address mathematical foundations, security, privacy, agreement, contract, discovery, negotiation, collaboration, and quality of service for web services. It also covers areas like composite web service creation, business and scientific applications, standards, utility models, business process modeling, integration, collaboration, and more in the realm of Services Computing.