Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa
{"title":"QuIP#:用Hadamard非相干和Lattice码本实现更好的LLM量化。","authors":"Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes (≤ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric <math> <msub><mrow><mi>E</mi></mrow> <mrow><mn>8</mn></mrow> </msub> </math> lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"48630-48656"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395268/pdf/","citationCount":"0","resultStr":"{\"title\":\"QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks.\",\"authors\":\"Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes (≤ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric <math> <msub><mrow><mi>E</mi></mrow> <mrow><mn>8</mn></mrow> </msub> </math> lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.</p>\",\"PeriodicalId\":74504,\"journal\":{\"name\":\"Proceedings of machine learning research\",\"volume\":\"235 \",\"pages\":\"48630-48656\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395268/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of machine learning research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
训练后量化(PTQ)通过将llm的权重量化到低精度,减少了llm的内存占用。在这项工作中,我们介绍了QuIP#,这是一种仅限权重的PTQ方法,使用三种新技术在极端压缩状态(每重量≤4比特)下实现了最先进的结果。首先,quip#通过使用随机Hadamard变换改进了QuIP (Chee et al., 2023)的非相干处理,该变换速度更快,具有更好的理论性质。其次,quip#使用矢量量化来利用非相干权值所具有的球状亚高斯分布:具体来说,我们引入了一组基于高度对称的e8晶格的硬件高效码本,实现了最优的8维单位球填充。第三,quip#使用微调来提高原始模型的保真度。我们的实验表明,quip#优于现有的PTQ方法,在PTQ扩展中实现了新的行为,并支持快速推理。我们的代码可以在https://github.com/Cornell-RelaxML/quip-sharp上找到。
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks.
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes (≤ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.