ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers.

Transactions on machine learning research Pub Date : 2024-02-01 Epub Date: 2024-02-27

Junjie Yin, Jiahao Dong, Yingheng Wang, Christopher De Sa, Volodymyr Kuleshov

{"title":"ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers.","authors":"Junjie Yin, Jiahao Dong, Yingheng Wang, Christopher De Sa, Volodymyr Kuleshov","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision on as little as one 24GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 2-bit and 3-bit LLMs for the first time-leveraging state-of-the-art 2-bit QuIP# quantization and 3-bit OPTQ quantization-outperforming finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, ModuLoRA attains competitive performance on text classification, natural language inference, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release ModuLoRA together with a series of low-precision models as part of LLMTools, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.</p>","PeriodicalId":75238,"journal":{"name":"Transactions on machine learning research","volume":"2024 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12362356/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions on machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/2/27 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision on as little as one 24GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 2-bit and 3-bit LLMs for the first time-leveraging state-of-the-art 2-bit QuIP# quantization and 3-bit OPTQ quantization-outperforming finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, ModuLoRA attains competitive performance on text classification, natural language inference, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release ModuLoRA together with a series of low-precision models as part of LLMTools, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.

本刊更多论文

ModuLoRA：通过集成模块化量化器对消费级gpu上的2位llm进行微调。

我们提出了一种用于大型语言模型（llm）的内存高效微调算法，该算法支持在一个24GB GPU上以2/3/4位精度对65B参数的llm进行微调。我们的方法，模块化低秩自适应（ModuLoRA），集成了任何用户指定的权重量化器，并通过低秩适配器（lora）进行微调。我们的方法依赖于一个简单的量化不可知的反向传递，该传递自适应地实现来自自定义黑盒量化模块的低精度LLM权重。这种方法首次实现了2位和3位llm的微调——利用最先进的2位quip#量化和3位OPTQ量化——优于依赖于不太复杂的4位和8位方法的微调。在我们的实验中，ModuLoRA在文本分类、自然语言推理和指令跟踪任务上取得了具有竞争力的性能，使用的内存比现有方法少得多，而且在一个流行的摘要任务上，我们也超过了最先进的ROUGE分数。我们发布了ModuLoRA和一系列低精度模型，作为LLMTools的一部分，LLMTools是一个用户友好的库，用于在消费级gpu上量化、运行和微调llm。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Transactions on machine learning research

自引率

0.00%

发文量