eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters Pub Date : 2024-02-07 DOI:10.1109/LCA.2024.3363492

Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal

{"title":"eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models","authors":"Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal","doi":"10.1109/LCA.2024.3363492","DOIUrl":null,"url":null,"abstract":"Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"37-40"},"PeriodicalIF":1.4000,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10423861/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).

查看原文本刊更多论文

eDKM：针对大型语言模型的高效、精确的训练时间权重聚类方法

由于大型语言模型（LLMs）在许多复杂的语言任务中都表现出了高质量的性能，因此人们对将这些 LLMs 引入移动设备以实现更快的响应和更好的隐私保护产生了浓厚的兴趣。然而，LLMs 的大小（即数十亿个参数）需要进行高效压缩，以适应存储空间有限的设备。在众多压缩技术中，权重聚类（一种非线性量化形式）是 LLM 压缩的主要候选技术之一，并为现代智能手机所支持。然而，它的训练开销对于 LLM 微调来说过于巨大。尤其是可微分 KMeans 聚类（或称 DKM），它在压缩率和精度回归之间做出了最先进的权衡，但其庞大的内存复杂度使其几乎无法应用于训练时间 LLM 压缩。在这封信中，我们提出了一种内存效率高的 DKM 实现--eDKM，它采用新技术将 DKM 的内存占用降低了几个数量级。对于要保存在 CPU 上用于 DKM 后向传递的给定张量，我们在检查之前复制到 CPU 的张量是否没有重复后，通过应用唯一性和分片来压缩张量。我们的实验结果表明，eDKM 可以通过将解码器层的训练时间内存占用减少 130 倍，对 Alpaca 数据集进行微调并将预训练的 LLaMA 7B 模型从 12.6 GB 压缩到 2.5 GB（3 b/weight），同时在更广泛的 LLM 基准上提供良好的准确性（例如，PIQA 为 77.7%，Winograde 为 66.1%，等等）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Computer Architecture Letters COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.60

自引率

4.30%

发文量

期刊介绍： IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.