Dengke Zhang;Quan Tang;Fagui Liu;Haiqing Mei;C. L. Philip Chen
{"title":"Exploring Token-Level Augmentation in Vision Transformer for Semi-Supervised Semantic Segmentation","authors":"Dengke Zhang;Quan Tang;Fagui Liu;Haiqing Mei;C. L. Philip Chen","doi":"10.1109/LSP.2025.3562821","DOIUrl":null,"url":null,"abstract":"Semi-supervised semantic segmentation has witnessed remarkable advancements in recent years. However, existing algorithms are based on convolutional neural networks, and directly applying them to Vision Transformers poses certain limitations due to conceptual disparities. To this end, we propose TokenSwap, a data augmentation technique designed explicitly for semi-supervised semantic segmentation with Vision Transformers. TokenSwap aligns well with the global attention mechanism by mixing images at the token level, enhancing the learning capability for contextual information among image patches and the utilization of unlabeled data. We further incorporate image augmentation and feature augmentation to promote the diversity of augmentation. Moreover, to enhance consistency regularization, we propose a dual-branch framework where each branch applies image and feature augmentation to the input image. We conduct extensive experiments across multiple benchmark datasets, including Pascal VOC 2012, Cityscapes, and COCO. Results suggest that the proposed method outperforms state-of-the-art algorithms with notably observed accuracy improvement, especially under limited fine annotations.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"1885-1889"},"PeriodicalIF":3.2000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10971227/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Semi-supervised semantic segmentation has witnessed remarkable advancements in recent years. However, existing algorithms are based on convolutional neural networks, and directly applying them to Vision Transformers poses certain limitations due to conceptual disparities. To this end, we propose TokenSwap, a data augmentation technique designed explicitly for semi-supervised semantic segmentation with Vision Transformers. TokenSwap aligns well with the global attention mechanism by mixing images at the token level, enhancing the learning capability for contextual information among image patches and the utilization of unlabeled data. We further incorporate image augmentation and feature augmentation to promote the diversity of augmentation. Moreover, to enhance consistency regularization, we propose a dual-branch framework where each branch applies image and feature augmentation to the input image. We conduct extensive experiments across multiple benchmark datasets, including Pascal VOC 2012, Cityscapes, and COCO. Results suggest that the proposed method outperforms state-of-the-art algorithms with notably observed accuracy improvement, especially under limited fine annotations.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.