Kai Jiang , Peng Peng , Youzao Lian , Weihui Shao , Weisheng Xu
{"title":"Neighbor patches merging reduces spatial redundancy to accelerate vision transformer","authors":"Kai Jiang , Peng Peng , Youzao Lian , Weihui Shao , Weisheng Xu","doi":"10.1016/j.neucom.2024.128733","DOIUrl":null,"url":null,"abstract":"<div><div>Vision Transformers (ViTs) deliver outstanding performance but often require substantial computational resources. Various token pruning methods have been developed to enhance throughput by removing redundant tokens; however, these methods do not address the peak memory consumption, which remains equivalent to that of the unpruned networks. In this study, we introduce Neighbor Patches Merging (NEPAM), a method that significantly reduces the maximum memory footprint of ViTs while pruning tokens. NEPAM targets spatial redundancy within images and prunes redundant patches at the onset of the model, thereby achieving the optimal throughput-accuracy trade-off without fine-tuning. Experimental results demonstrate that NEPAM can accelerate the inference speed of the Vit-Base-Patch16-384 model by 25% with a negligible accuracy loss of <strong>0.07%</strong> and a notable <strong>18%</strong> reduction in memory usage. When applied to VideoMAE, NEPAM doubles the throughput with a <strong>0.29%</strong> accuracy loss and a <strong>48%</strong> reduction in memory usage. These findings underscore the efficacy of NEPAM in mitigating computational requirements while maintaining model performance.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015042","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Vision Transformers (ViTs) deliver outstanding performance but often require substantial computational resources. Various token pruning methods have been developed to enhance throughput by removing redundant tokens; however, these methods do not address the peak memory consumption, which remains equivalent to that of the unpruned networks. In this study, we introduce Neighbor Patches Merging (NEPAM), a method that significantly reduces the maximum memory footprint of ViTs while pruning tokens. NEPAM targets spatial redundancy within images and prunes redundant patches at the onset of the model, thereby achieving the optimal throughput-accuracy trade-off without fine-tuning. Experimental results demonstrate that NEPAM can accelerate the inference speed of the Vit-Base-Patch16-384 model by 25% with a negligible accuracy loss of 0.07% and a notable 18% reduction in memory usage. When applied to VideoMAE, NEPAM doubles the throughput with a 0.29% accuracy loss and a 48% reduction in memory usage. These findings underscore the efficacy of NEPAM in mitigating computational requirements while maintaining model performance.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.