{"title":"Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models","authors":"Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou","doi":"arxiv-2409.10197","DOIUrl":null,"url":null,"abstract":"Recent progress in Multimodal Large Language Models(MLLMs) often use large\nimage tokens to compensate the visual shortcoming of MLLMs, which not only\nexhibits obvious redundancy but also greatly exacerbates the already high\ncomputation. Token pruning is an effective solution for speeding up MLLMs, but\nwhen and how to drop tokens still remains a challenge. In this paper, we\npropose a novel and training-free approach for the effective visual token\npruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning\nrecipe for MLLMs according to a pre-defined budget. Specifically, FitPrune\nconsiders token pruning as a statistical problem of MLLM and its objective is\nto find out an optimal pruning scheme that can minimize the divergence of the\nattention distributions before and after pruning. In practice, FitPrune can be\nquickly accomplished based on the attention statistics from a small batch of\ninference data, avoiding the expensive trials of MLLMs. According to the\npruning recipe, an MLLM can directly remove the redundant visual tokens of\ndifferent examples during inference. To validate FitPrune, we apply it to a set\nof recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct\nextensive experiments on a set of benchmarks. The experimental results show\nthat our FitPrune can not only reduce the computational complexity to a large\nextent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT\nwith only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in\nabout 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10197","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent progress in Multimodal Large Language Models(MLLMs) often use large
image tokens to compensate the visual shortcoming of MLLMs, which not only
exhibits obvious redundancy but also greatly exacerbates the already high
computation. Token pruning is an effective solution for speeding up MLLMs, but
when and how to drop tokens still remains a challenge. In this paper, we
propose a novel and training-free approach for the effective visual token
pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning
recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune
considers token pruning as a statistical problem of MLLM and its objective is
to find out an optimal pruning scheme that can minimize the divergence of the
attention distributions before and after pruning. In practice, FitPrune can be
quickly accomplished based on the attention statistics from a small batch of
inference data, avoiding the expensive trials of MLLMs. According to the
pruning recipe, an MLLM can directly remove the redundant visual tokens of
different examples during inference. To validate FitPrune, we apply it to a set
of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct
extensive experiments on a set of benchmarks. The experimental results show
that our FitPrune can not only reduce the computational complexity to a large
extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT
with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in
about 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.