Efficient feature selection for pre-trained vision transformers

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI:10.1016/j.cviu.2025.104326

Lan Huang , Jia Zeng , Mengqiang Yu , Weiping Ding , Xingyu Bai , Kangping Wang

{"title":"Efficient feature selection for pre-trained vision transformers","authors":"Lan Huang , Jia Zeng , Mengqiang Yu , Weiping Ding , Xingyu Bai , Kangping Wang","doi":"10.1016/j.cviu.2025.104326","DOIUrl":null,"url":null,"abstract":"<div><div>Handcrafted layer-wise vision transformers have demonstrated remarkable performance in image classification. However, their high computational cost limits their practical applications. In this paper, we first identify and highlight the data-independent feature redundancy in pre-trained Vision Transformer (ViT) models. Based on this observation, we explore the feasibility of searching for the best substructure within the original pre-trained model. To this end, we propose EffiSelecViT, a novel pruning method aimed at reducing the computational cost of ViTs while preserving their accuracy. EffiSelecViT introduces importance scores for both self-attention heads and Multi-Layer Perceptron (MLP) neurons in pre-trained ViT models. L1 regularization is applied to constrain and learn these scores. In this simple way, components that are crucial for model performance are assigned higher scores, while those with lower scores are identified as less important and subsequently pruned. Experimental results demonstrate that EffiSelecViT can prune DeiT-B to retain only 64% of FLOPs while maintaining accuracy. This efficiency-accuracy trade-off is consistent across various ViT architectures. Furthermore, qualitative analysis reveals enhanced information expression in the pruned models, affirming the effectiveness and practicality of EffiSelecViT. The code is available at <span><span>https://github.com/ZJ6789/EffiSelecViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104326"},"PeriodicalIF":4.3000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225000499","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Handcrafted layer-wise vision transformers have demonstrated remarkable performance in image classification. However, their high computational cost limits their practical applications. In this paper, we first identify and highlight the data-independent feature redundancy in pre-trained Vision Transformer (ViT) models. Based on this observation, we explore the feasibility of searching for the best substructure within the original pre-trained model. To this end, we propose EffiSelecViT, a novel pruning method aimed at reducing the computational cost of ViTs while preserving their accuracy. EffiSelecViT introduces importance scores for both self-attention heads and Multi-Layer Perceptron (MLP) neurons in pre-trained ViT models. L1 regularization is applied to constrain and learn these scores. In this simple way, components that are crucial for model performance are assigned higher scores, while those with lower scores are identified as less important and subsequently pruned. Experimental results demonstrate that EffiSelecViT can prune DeiT-B to retain only 64% of FLOPs while maintaining accuracy. This efficiency-accuracy trade-off is consistent across various ViT architectures. Furthermore, qualitative analysis reveals enhanced information expression in the pruned models, affirming the effectiveness and practicality of EffiSelecViT. The code is available at https://github.com/ZJ6789/EffiSelecViT.

查看原文本刊更多论文

预训练视觉变压器的高效特征选择

手工制作的分层视觉变压器在图像分类中表现出了显著的性能。然而，它们高昂的计算成本限制了它们的实际应用。在本文中，我们首先识别并突出了预训练视觉变压器（ViT）模型中与数据无关的特征冗余。基于这一观察，我们探索了在原始预训练模型中搜索最佳子结构的可行性。为此，我们提出了EffiSelecViT，一种新的修剪方法，旨在降低vit的计算成本，同时保持其准确性。EffiSelecViT在预训练的ViT模型中引入了自注意头和多层感知器（MLP）神经元的重要性分数。L1正则化应用于约束和学习这些分数。在这种简单的方式中，对模型性能至关重要的组件被分配更高的分数，而那些分数较低的组件被认为不太重要，并随后被修剪。实验结果表明，EffiSelecViT可以对DeiT-B进行修剪，在保持精度的同时仅保留64%的FLOPs。这种效率与准确性的权衡在各种ViT体系结构中是一致的。此外，定性分析表明，修剪后的模型信息表达增强，肯定了EffiSelecViT的有效性和实用性。代码可在https://github.com/ZJ6789/EffiSelecViT上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems