CSKV：长上下文场景中 KV 高速缓存的高效通道缩减训练

arXiv - CS - Machine Learning Pub Date : 2024-09-16 DOI:arxiv-2409.10593

Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang

{"title":"CSKV：长上下文场景中 KV 高速缓存的高效通道缩减训练","authors":"Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang","doi":"arxiv-2409.10593","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have been widely adopted to process long-context\ntasks. However, the large memory overhead of the key-value (KV) cache poses\nsignificant challenges in long-context scenarios. Existing training-free KV\ncache compression methods typically focus on quantization and token pruning,\nwhich have compression limits, and excessive sparsity can lead to severe\nperformance degradation. Other methods design new architectures with less KV\noverhead but require significant training overhead. To address the above two\ndrawbacks, we further explore the redundancy in the channel dimension and apply\nan architecture-level design with minor training costs. Therefore, we introduce\nCSKV, a training-efficient Channel Shrinking technique for KV cache\ncompression: (1) We first analyze the singular value distribution of the KV\ncache, revealing significant redundancy and compression potential along the\nchannel dimension. Based on this observation, we propose using low-rank\ndecomposition for key and value layers and storing the low-dimension features.\n(2) To preserve model performance, we introduce a bi-branch KV cache, including\na window-based full-precision KV cache and a low-precision compressed KV cache.\n(3) To reduce the training costs, we minimize the layer-wise reconstruction\nloss for the compressed KV cache instead of retraining the entire LLMs.\nExtensive experiments show that CSKV can reduce the memory overhead of the KV\ncache by 80% while maintaining the model's long-context capability. Moreover,\nwe show that our method can be seamlessly combined with quantization to further\nreduce the memory overhead, achieving a compression ratio of up to 95%.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":"45 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios\",\"authors\":\"Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang\",\"doi\":\"arxiv-2409.10593\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) have been widely adopted to process long-context\\ntasks. However, the large memory overhead of the key-value (KV) cache poses\\nsignificant challenges in long-context scenarios. Existing training-free KV\\ncache compression methods typically focus on quantization and token pruning,\\nwhich have compression limits, and excessive sparsity can lead to severe\\nperformance degradation. Other methods design new architectures with less KV\\noverhead but require significant training overhead. To address the above two\\ndrawbacks, we further explore the redundancy in the channel dimension and apply\\nan architecture-level design with minor training costs. Therefore, we introduce\\nCSKV, a training-efficient Channel Shrinking technique for KV cache\\ncompression: (1) We first analyze the singular value distribution of the KV\\ncache, revealing significant redundancy and compression potential along the\\nchannel dimension. Based on this observation, we propose using low-rank\\ndecomposition for key and value layers and storing the low-dimension features.\\n(2) To preserve model performance, we introduce a bi-branch KV cache, including\\na window-based full-precision KV cache and a low-precision compressed KV cache.\\n(3) To reduce the training costs, we minimize the layer-wise reconstruction\\nloss for the compressed KV cache instead of retraining the entire LLMs.\\nExtensive experiments show that CSKV can reduce the memory overhead of the KV\\ncache by 80% while maintaining the model's long-context capability. Moreover,\\nwe show that our method can be seamlessly combined with quantization to further\\nreduce the memory overhead, achieving a compression ratio of up to 95%.\",\"PeriodicalId\":501301,\"journal\":{\"name\":\"arXiv - CS - Machine Learning\",\"volume\":\"45 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10593\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLM）已被广泛用于处理长语境任务。然而，键值（KV）缓存的大内存开销在长上下文场景中构成了重大挑战。现有的免训练 KV 缓存压缩方法通常侧重于量化和标记剪枝，这两种方法都有压缩限制，而且过度稀疏会导致性能严重下降。其他方法设计的新架构具有较少的 KV 开销，但需要大量的训练开销。为了解决上述两个缺点，我们进一步探索了信道维度的冗余性，并采用了训练成本较低的架构级设计。因此，我们为 KV 缓存压缩引入了一种训练效率高的通道收缩技术--CSKV：（1）我们首先分析了 KV 缓存的奇异值分布，发现了通道维度上的显著冗余和压缩潜力。(2) 为了保持模型性能，我们引入了双分支 KV 缓存，包括基于窗口的全精度 KV 缓存和低精度压缩 KV 缓存。(3) 为了降低训练成本，我们最小化了压缩 KV 缓存的分层重构损失，而不是重新训练整个 LLM。此外，我们还证明，我们的方法可以与量化无缝结合，进一步减少内存开销，实现高达 95% 的压缩率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

Large Language Models (LLMs) have been widely adopted to process long-context tasks. However, the large memory overhead of the key-value (KV) cache poses significant challenges in long-context scenarios. Existing training-free KV cache compression methods typically focus on quantization and token pruning, which have compression limits, and excessive sparsity can lead to severe performance degradation. Other methods design new architectures with less KV overhead but require significant training overhead. To address the above two drawbacks, we further explore the redundancy in the channel dimension and apply an architecture-level design with minor training costs. Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression potential along the channel dimension. Based on this observation, we propose using low-rank decomposition for key and value layers and storing the low-dimension features. (2) To preserve model performance, we introduce a bi-branch KV cache, including a window-based full-precision KV cache and a low-precision compressed KV cache. (3) To reduce the training costs, we minimize the layer-wise reconstruction loss for the compressed KV cache instead of retraining the entire LLMs. Extensive experiments show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability. Moreover, we show that our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Machine Learning

自引率

0.00%

发文量