{"title":"Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models","authors":"Bishwash Khanal, Jeffery M. Capone","doi":"arxiv-2409.11233","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) offer powerful capabilities but incur\nsubstantial computational costs, driving the need for efficient compression\ntechniques. This study evaluates the impact of popular compression methods -\nMagnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing on\nthe trade-offs between model size reduction, downstream task performance, and\nthe role of calibration data. Our findings reveal that while SparseGPT and\nWanda preserve perplexity even at 50% sparsity, they suffer significant\ndegradation on downstream tasks, highlighting the inadequacy of perplexity as\nthe sole evaluation metric. To address this, we introduce Jensen-Shannon (JS)\nDivergence as a more comprehensive metric that captures nuanced changes in\nmodel behavior post-compression. We further demonstrate that task-specific\ncalibration data significantly enhances the downstream performance of\ncompressed models compared to general calibration data. This research\nunderscores the necessity for diverse evaluation metrics and careful\ncalibration data selection to fully understand the complexities of LLM\ncompression and its implications for practical applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11233","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) offer powerful capabilities but incur
substantial computational costs, driving the need for efficient compression
techniques. This study evaluates the impact of popular compression methods -
Magnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing on
the trade-offs between model size reduction, downstream task performance, and
the role of calibration data. Our findings reveal that while SparseGPT and
Wanda preserve perplexity even at 50% sparsity, they suffer significant
degradation on downstream tasks, highlighting the inadequacy of perplexity as
the sole evaluation metric. To address this, we introduce Jensen-Shannon (JS)
Divergence as a more comprehensive metric that captures nuanced changes in
model behavior post-compression. We further demonstrate that task-specific
calibration data significantly enhances the downstream performance of
compressed models compared to general calibration data. This research
underscores the necessity for diverse evaluation metrics and careful
calibration data selection to fully understand the complexities of LLM
compression and its implications for practical applications.