Jongseong Bae , Susang Kim , Minsu Cho , Ha Young Kim
{"title":"MVFormer: Diversifying feature normalization and token mixing for efficient vision transformers","authors":"Jongseong Bae , Susang Kim , Minsu Cho , Ha Young Kim","doi":"10.1016/j.patrec.2025.07.019","DOIUrl":null,"url":null,"abstract":"<div><div>Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: multi-view normalization (MVN) and multi-view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum, expected to offer diverse feature distribution to the token mixer, resulting in beneficial synergy. The MVTM is a convolution-based multiscale token mixer with local, intermediate, and global filters which incorporates stage specificity by configuring various receptive fields at each stage, efficiently capturing ranges of visual patterns. By adopting both in the MetaFormer block, we propose a novel ViT, multi-vision transformer (MVFormer). Our MVFormer outperforms state-of-the-art convolution-based ViTs on image classification with the same or lower parameters and MACs. Particularly, MVFormer variants, MVFormer-T, S, and B achieve 83.4 %, 84.3 %, and 84.6 % top-1 accuracy, respectively, on ImageNet-1 K benchmark.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"197 ","pages":"Pages 72-80"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002715","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: multi-view normalization (MVN) and multi-view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum, expected to offer diverse feature distribution to the token mixer, resulting in beneficial synergy. The MVTM is a convolution-based multiscale token mixer with local, intermediate, and global filters which incorporates stage specificity by configuring various receptive fields at each stage, efficiently capturing ranges of visual patterns. By adopting both in the MetaFormer block, we propose a novel ViT, multi-vision transformer (MVFormer). Our MVFormer outperforms state-of-the-art convolution-based ViTs on image classification with the same or lower parameters and MACs. Particularly, MVFormer variants, MVFormer-T, S, and B achieve 83.4 %, 84.3 %, and 84.6 % top-1 accuracy, respectively, on ImageNet-1 K benchmark.
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.