{"title":"Res2former: A multi-scale fusion based transformer feature extraction method","authors":"Bojun Xie, Yanjie Wang, Shaocong Guo, Junfen Chen","doi":"10.1016/j.jvcir.2025.104546","DOIUrl":null,"url":null,"abstract":"<div><div>In this paper, we propose Res2former, a novel lightweight hybrid architecture that combines convolutional neural networks (CNNs) and Transformers to effectively model both local and global dependencies in visual data. While Vision Transformer (ViT) demonstrates strong global modeling capability, it lack locality and translation-invariance, making it reliant on large-scale datasets and computational resources. To address this, Res2former adopts a stage-wise hybrid design: in shallow layers, CNNs replace Transformer blocks to exploit local inductive biases and reduce early computational cost; in deeper layers, we introduce a multi-scale fusion mechanism by embedding multiple parallel convolutional kernels of varying receptive fields into the Transformer’s MLP structure. This enables Res2former to capture multi-scale visual semantics more effectively and fuse features across different scales. Experimental results reveal that with the same parameters and computational complexity, Res2former outperforms variants of Transformer and CNN models in image classification (80.7 top-1 accuracy on ImageNet-1K), object detection (45.8 <span><math><mrow><mi>A</mi><msup><mrow><mi>P</mi></mrow><mrow><mi>b</mi><mi>o</mi><mi>x</mi></mrow></msup></mrow></math></span> on the COCO 2017 Validation Set), and instance segmentation (41.0 <span><math><mrow><mi>A</mi><msup><mrow><mi>P</mi></mrow><mrow><mi>m</mi><mi>a</mi><mi>s</mi><mi>k</mi></mrow></msup></mrow></math></span> on the COCO 2017 Validation Set) tasks. The code is publicly accessible at <span><span>https://github.com/hand-Max/Res2former</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104546"},"PeriodicalIF":3.1000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047320325001609","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we propose Res2former, a novel lightweight hybrid architecture that combines convolutional neural networks (CNNs) and Transformers to effectively model both local and global dependencies in visual data. While Vision Transformer (ViT) demonstrates strong global modeling capability, it lack locality and translation-invariance, making it reliant on large-scale datasets and computational resources. To address this, Res2former adopts a stage-wise hybrid design: in shallow layers, CNNs replace Transformer blocks to exploit local inductive biases and reduce early computational cost; in deeper layers, we introduce a multi-scale fusion mechanism by embedding multiple parallel convolutional kernels of varying receptive fields into the Transformer’s MLP structure. This enables Res2former to capture multi-scale visual semantics more effectively and fuse features across different scales. Experimental results reveal that with the same parameters and computational complexity, Res2former outperforms variants of Transformer and CNN models in image classification (80.7 top-1 accuracy on ImageNet-1K), object detection (45.8 on the COCO 2017 Validation Set), and instance segmentation (41.0 on the COCO 2017 Validation Set) tasks. The code is publicly accessible at https://github.com/hand-Max/Res2former.
期刊介绍:
The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.