Yingwei Pan;Yehao Li;Ting Yao;Chong-Wah Ngo;Tao Mei
{"title":"Stream-ViT: Learning Streamlined Convolutions in Vision Transformer","authors":"Yingwei Pan;Yehao Li;Ting Yao;Chong-Wah Ngo;Tao Mei","doi":"10.1109/TMM.2025.3535321","DOIUrl":null,"url":null,"abstract":"Recently Vision Transformer (ViT) and Convolution Neural Network (CNN) start to emerge as a hybrid deep architecture with better model capacity, generalization, and latency trade-off. Most of these hybrid architectures often directly stack self-attention module with static convolution or fuse their outputs through two pathways within each block. Instead, we present a new Transformer architecture (namely Stream-ViT) to novelly integrate ViT with streamlined convolutions, i.e., a series of high-to-low resolution convolutions. The kernels of each convolution are dynamically learnt on a basis of current input features plus pre-learnt kernels throughout the whole network. The new architecture incorporates a critical pathway to streamline kernel generation that triggers the interactions between dynamically learnt convolutions across different layers. Moreover, the introduction of a layer-wise streamlined convolution is functionally equivalent to a squeezed version of multi-branch convolution structure, thereby improving the capacity of self-attention module with enlarged cardinality in a cost-efficient manner. We validate the superiority of Stream-ViT over multiple vision tasks, and its performances surpass state-of-the-art ViT and CNN backbones with comparable FLOPs.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3755-3765"},"PeriodicalIF":8.4000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855496/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Recently Vision Transformer (ViT) and Convolution Neural Network (CNN) start to emerge as a hybrid deep architecture with better model capacity, generalization, and latency trade-off. Most of these hybrid architectures often directly stack self-attention module with static convolution or fuse their outputs through two pathways within each block. Instead, we present a new Transformer architecture (namely Stream-ViT) to novelly integrate ViT with streamlined convolutions, i.e., a series of high-to-low resolution convolutions. The kernels of each convolution are dynamically learnt on a basis of current input features plus pre-learnt kernels throughout the whole network. The new architecture incorporates a critical pathway to streamline kernel generation that triggers the interactions between dynamically learnt convolutions across different layers. Moreover, the introduction of a layer-wise streamlined convolution is functionally equivalent to a squeezed version of multi-branch convolution structure, thereby improving the capacity of self-attention module with enlarged cardinality in a cost-efficient manner. We validate the superiority of Stream-ViT over multiple vision tasks, and its performances surpass state-of-the-art ViT and CNN backbones with comparable FLOPs.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.