{"title":"TPI-LLM: Serving 70B-Scale LLMs Efficiently on Low-Resource Mobile Devices","authors":"Zonghang Li;Wenjiao Feng;Mohsen Guizani;Hongfang Yu","doi":"10.1109/TSC.2025.3596892","DOIUrl":"https://doi.org/10.1109/TSC.2025.3596892","url":null,"abstract":"LLM serving is shifting from cloud to edge due to privacy concerns over user interaction data. However, mobile devices struggle with very limited computing power and memory, requiring collaboration among multiple devices to run LLM apps. The mainstream solution, pipeline parallelism, is inefficient for such cases because mobile devices typically run only one inference task at a time. This article argues that tensor parallelism, despite its high communication cost, can better fit such scenarios. We introduce TPI-LLM, a compute and memory-efficient tensor parallel inference system designed to run 70B-scale LLMs on low-resource mobile devices. It keeps sensitive raw data local on users’ devices and employs a sliding window memory scheduler to dynamically manage layer weights. It overlaps disk I/O with computation and communication, enabling efficient operation of large models on memory-limited devices. Extensive experiments show that TPI-LLM reduces token latency by 80%–90% compared to Transformers, Accelerate, and Galaxy. It also cuts the peak memory footprint by 90%, requiring just 3.1 GiB of memory for 70B-scale models.","PeriodicalId":13255,"journal":{"name":"IEEE Transactions on Services Computing","volume":"18 5","pages":"3321-3333"},"PeriodicalIF":5.8,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaomei Huang, Zhiheng Zhou, Jianjun Li, Neal N. Xiong, Yugen Yi, Jin Liu, Guoqiong Liao
{"title":"An Effective Multi-Scale Contrastive Learning System for Online Group Recommendation Services in Event-Based Social Networks","authors":"Xiaomei Huang, Zhiheng Zhou, Jianjun Li, Neal N. Xiong, Yugen Yi, Jin Liu, Guoqiong Liao","doi":"10.1109/tsc.2025.3593346","DOIUrl":"https://doi.org/10.1109/tsc.2025.3593346","url":null,"abstract":"","PeriodicalId":13255,"journal":{"name":"IEEE Transactions on Services Computing","volume":"11 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144736812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}