An empirical study of LLaMA3 quantization: from LLMs to MLLMs.

Visual intelligence Pub Date : 2024-01-01 Epub Date: 2024-12-30 DOI:10.1007/s44267-024-00070-x

Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno

{"title":"An empirical study of LLaMA3 quantization: from LLMs to MLLMs.","authors":"Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno","doi":"10.1007/s44267-024-00070-x","DOIUrl":null,"url":null,"abstract":"<p><p>The LLaMA family, a collection of foundation language models ranging from 7B to 65B parameters, has become one of the most powerful open-source large language models (LLMs) and the popular LLM backbone of multi-modal large language models (MLLMs), widely used in computer vision and natural language understanding tasks. In particular, LLaMA3 models have recently been released and have achieved impressive performance in various domains with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-constrained scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration can potentially provide new insights and challenges for the low-bit quantization of LLaMA3 and other future LLMs, especially in addressing performance degradation issues that suffer in LLM compression. Specifically, we comprehensively evaluate the 10 existing post-training quantization and LoRA fine-tuning (LoRA-FT) methods of LLaMA3 on 1-8 bits and various datasets to reveal the low-bit quantization performance of LLaMA3. To uncover the capabilities of low-bit quantized MLLM, we assessed the performance of the LLaMA3-based LLaVA-Next-8B model under 2-4 ultra-low bits with post-training quantization methods. Our experimental results indicate that LLaMA3 still suffers from non-negligible degradation in linguistic and visual contexts, particularly under ultra-low bit widths. This highlights the significant performance gap at low bit-width that needs to be addressed in future developments. We expect that this empirical study will prove valuable in advancing future models, driving LLMs and MLLMs to achieve higher accuracy at lower bit to enhance practicality.</p>","PeriodicalId":520376,"journal":{"name":"Visual intelligence","volume":"2 1","pages":"36"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11728678/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Visual intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s44267-024-00070-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/30 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The LLaMA family, a collection of foundation language models ranging from 7B to 65B parameters, has become one of the most powerful open-source large language models (LLMs) and the popular LLM backbone of multi-modal large language models (MLLMs), widely used in computer vision and natural language understanding tasks. In particular, LLaMA3 models have recently been released and have achieved impressive performance in various domains with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-constrained scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration can potentially provide new insights and challenges for the low-bit quantization of LLaMA3 and other future LLMs, especially in addressing performance degradation issues that suffer in LLM compression. Specifically, we comprehensively evaluate the 10 existing post-training quantization and LoRA fine-tuning (LoRA-FT) methods of LLaMA3 on 1-8 bits and various datasets to reveal the low-bit quantization performance of LLaMA3. To uncover the capabilities of low-bit quantized MLLM, we assessed the performance of the LLaMA3-based LLaVA-Next-8B model under 2-4 ultra-low bits with post-training quantization methods. Our experimental results indicate that LLaMA3 still suffers from non-negligible degradation in linguistic and visual contexts, particularly under ultra-low bit widths. This highlights the significant performance gap at low bit-width that needs to be addressed in future developments. We expect that this empirical study will prove valuable in advancing future models, driving LLMs and MLLMs to achieve higher accuracy at lower bit to enhance practicality.

查看原文本刊更多论文

LLaMA3 量化实证研究：从 LLM 到 MLLM。

LLaMA家族集合了从7B到65B参数的基础语言模型，已成为最强大的开源大型语言模型（LLM）之一，也是多模态大型语言模型（mllm）中流行的LLM骨干，广泛应用于计算机视觉和自然语言理解任务。特别是最近发布的LLaMA3模型，通过对超过15T令牌的数据进行超大规模预训练，在各个领域取得了令人印象深刻的性能。考虑到llm在资源受限情况下的低比特量化的广泛应用，我们探索了LLaMA3在量化到低比特宽时的能力。这一探索可能为LLaMA3和其他未来LLM的低比特量化提供新的见解和挑战，特别是在解决LLM压缩中遭受的性能下降问题方面。具体而言，我们综合评估了LLaMA3现有的10种训练后量化和LoRA微调（LoRA- ft）方法在1-8位和各种数据集上的性能，以揭示LLaMA3的低比特量化性能。为了揭示低比特量化MLLM的能力，我们使用训练后量化方法评估了基于llama3的LLaVA-Next-8B模型在2-4个超低比特下的性能。我们的实验结果表明，LLaMA3在语言和视觉上下文中仍然存在不可忽略的退化，特别是在超低比特宽度下。这突出了在未来的开发中需要解决的低位宽的显著性能差距。我们期望这一实证研究将对未来模型的发展有价值，推动llm和mllm在更低的比特上实现更高的精度，以增强实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Visual intelligence

自引率

0.00%

发文量