Mosaic: Composite projection pruning for resource-efficient LLMs

IF 6.2 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS
Bailey J. Eccles , Leon Wong , Blesson Varghese
{"title":"Mosaic: Composite projection pruning for resource-efficient LLMs","authors":"Bailey J. Eccles ,&nbsp;Leon Wong ,&nbsp;Blesson Varghese","doi":"10.1016/j.future.2025.108056","DOIUrl":null,"url":null,"abstract":"<div><div>Extensive compute and memory requirements limit the deployment of large language models (LLMs) on any hardware. Compression methods, such as pruning, can reduce model size, which in turn reduces resource requirements. State-of-the-art pruning is based on coarse-grained methods. They are time-consuming and inherently remove critical model parameters, adversely impacting the quality of the pruned model. This paper introduces projection pruning, a novel fine-grained method for pruning LLMs. In addition, LLM projection pruning is enhanced by a new approach we refer to as composite projection pruning — the synergistic combination of unstructured pruning that retains accuracy and structured pruning that reduces model size. We develop <span><span>Mosaic</span></span>, a novel system to create and deploy pruned LLMs using composite projection pruning. <span><span>Mosaic</span></span> is evaluated using a range of performance and quality metrics on multiple hardware platforms, LLMs, and datasets. <span><span>Mosaic</span></span> is 7.19<span><math><mo>×</mo></math></span> faster in producing models than existing approaches. <span><span>Mosaic</span></span> models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning. Up to 67% faster inference and 68% lower GPU memory use is noted for <span><span>Mosaic</span></span> models.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108056"},"PeriodicalIF":6.2000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25003516","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Extensive compute and memory requirements limit the deployment of large language models (LLMs) on any hardware. Compression methods, such as pruning, can reduce model size, which in turn reduces resource requirements. State-of-the-art pruning is based on coarse-grained methods. They are time-consuming and inherently remove critical model parameters, adversely impacting the quality of the pruned model. This paper introduces projection pruning, a novel fine-grained method for pruning LLMs. In addition, LLM projection pruning is enhanced by a new approach we refer to as composite projection pruning — the synergistic combination of unstructured pruning that retains accuracy and structured pruning that reduces model size. We develop Mosaic, a novel system to create and deploy pruned LLMs using composite projection pruning. Mosaic is evaluated using a range of performance and quality metrics on multiple hardware platforms, LLMs, and datasets. Mosaic is 7.19× faster in producing models than existing approaches. Mosaic models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning. Up to 67% faster inference and 68% lower GPU memory use is noted for Mosaic models.
镶嵌:资源高效llm的复合投影剪枝
大量的计算和内存需求限制了在任何硬件上部署大型语言模型(llm)。压缩方法,如剪枝,可以减少模型大小,从而减少资源需求。最先进的修剪是基于粗粒度的方法。它们是耗时的,并且固有地删除了关键的模型参数,对修剪后的模型的质量产生了不利影响。本文介绍了一种新的细粒度llm剪枝方法——投影剪枝。此外,LLM投影剪枝通过一种我们称之为复合投影剪枝的新方法得到增强,这种新方法是保持准确性的非结构化剪枝和减小模型尺寸的结构化剪枝的协同组合。我们开发了Mosaic,这是一个使用复合投影修剪来创建和部署修剪过的llm的新系统。Mosaic在多个硬件平台、llm和数据集上使用一系列性能和质量指标进行评估。Mosaic生成模型的速度比现有方法快7.19倍。与粗粒度修剪模型相比,马赛克模型的perplexity降低84.2%,准确率提高31.4%。马赛克模型的推理速度提高了67%,GPU内存使用降低了68%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
19.90
自引率
2.70%
发文量
376
审稿时长
10.6 months
期刊介绍: Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信