{"title":"基于通道和空间特征融合的变压器RGB-T跟踪","authors":"Yunfeng Li;Bo Wang;Ye Li","doi":"10.1109/JSEN.2025.3579339","DOIUrl":null,"url":null,"abstract":"The main problem in RGB-thermal (RGB-T) tracking is the correct and optimal merging of the cross-modal features of visible and thermal images. Some previous methods either do not fully exploit the potential of RGB and TIR information for channel and spatial feature fusion or lack a direct interaction between the template and the search area, which limits the model’s ability to fully utilize the original semantic information of both modalities. To address these limitations, we investigate how to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking and propose channel and spatial transformer network (CSTNet). It uses the vision transformer (ViT) as the backbone and adds a joint spatial and channel fusion module (JSCFM) and spatial fusion module (SFM) integrated between the transformer blocks to facilitate cross-modal feature interaction. The JSCFM module achieves joint modeling of channel and multilevel spatial features. The SFM module includes a cross-attention-like architecture for cross modeling and joint learning of RGB and TIR features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance. To enhance practicality, we retrain the model without JSCFM and SFM modules and use CSNet as the pretraining weight, and propose CSTNet-small, which achieves 50% speedup with an average decrease of 1%–2% in SR and PR performance. CSTNet and CSTNet-small achieve real-time speeds of 21 and 33 frames/s on the Nvidia Jetson Xavier, meeting actual deployment requirements. Code is available at <uri>https://github.com/LiYunfengLYF/CSTNet</uri>","PeriodicalId":447,"journal":{"name":"IEEE Sensors Journal","volume":"25 15","pages":"28891-28904"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-Based RGB-T Tracking With Channel and Spatial Feature Fusion\",\"authors\":\"Yunfeng Li;Bo Wang;Ye Li\",\"doi\":\"10.1109/JSEN.2025.3579339\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The main problem in RGB-thermal (RGB-T) tracking is the correct and optimal merging of the cross-modal features of visible and thermal images. Some previous methods either do not fully exploit the potential of RGB and TIR information for channel and spatial feature fusion or lack a direct interaction between the template and the search area, which limits the model’s ability to fully utilize the original semantic information of both modalities. To address these limitations, we investigate how to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking and propose channel and spatial transformer network (CSTNet). It uses the vision transformer (ViT) as the backbone and adds a joint spatial and channel fusion module (JSCFM) and spatial fusion module (SFM) integrated between the transformer blocks to facilitate cross-modal feature interaction. The JSCFM module achieves joint modeling of channel and multilevel spatial features. The SFM module includes a cross-attention-like architecture for cross modeling and joint learning of RGB and TIR features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance. To enhance practicality, we retrain the model without JSCFM and SFM modules and use CSNet as the pretraining weight, and propose CSTNet-small, which achieves 50% speedup with an average decrease of 1%–2% in SR and PR performance. CSTNet and CSTNet-small achieve real-time speeds of 21 and 33 frames/s on the Nvidia Jetson Xavier, meeting actual deployment requirements. Code is available at <uri>https://github.com/LiYunfengLYF/CSTNet</uri>\",\"PeriodicalId\":447,\"journal\":{\"name\":\"IEEE Sensors Journal\",\"volume\":\"25 15\",\"pages\":\"28891-28904\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Sensors Journal\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11040139/\",\"RegionNum\":2,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Sensors Journal","FirstCategoryId":"103","ListUrlMain":"https://ieeexplore.ieee.org/document/11040139/","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Transformer-Based RGB-T Tracking With Channel and Spatial Feature Fusion
The main problem in RGB-thermal (RGB-T) tracking is the correct and optimal merging of the cross-modal features of visible and thermal images. Some previous methods either do not fully exploit the potential of RGB and TIR information for channel and spatial feature fusion or lack a direct interaction between the template and the search area, which limits the model’s ability to fully utilize the original semantic information of both modalities. To address these limitations, we investigate how to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking and propose channel and spatial transformer network (CSTNet). It uses the vision transformer (ViT) as the backbone and adds a joint spatial and channel fusion module (JSCFM) and spatial fusion module (SFM) integrated between the transformer blocks to facilitate cross-modal feature interaction. The JSCFM module achieves joint modeling of channel and multilevel spatial features. The SFM module includes a cross-attention-like architecture for cross modeling and joint learning of RGB and TIR features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance. To enhance practicality, we retrain the model without JSCFM and SFM modules and use CSNet as the pretraining weight, and propose CSTNet-small, which achieves 50% speedup with an average decrease of 1%–2% in SR and PR performance. CSTNet and CSTNet-small achieve real-time speeds of 21 and 33 frames/s on the Nvidia Jetson Xavier, meeting actual deployment requirements. Code is available at https://github.com/LiYunfengLYF/CSTNet
期刊介绍:
The fields of interest of the IEEE Sensors Journal are the theory, design , fabrication, manufacturing and applications of devices for sensing and transducing physical, chemical and biological phenomena, with emphasis on the electronics and physics aspect of sensors and integrated sensors-actuators. IEEE Sensors Journal deals with the following:
-Sensor Phenomenology, Modelling, and Evaluation
-Sensor Materials, Processing, and Fabrication
-Chemical and Gas Sensors
-Microfluidics and Biosensors
-Optical Sensors
-Physical Sensors: Temperature, Mechanical, Magnetic, and others
-Acoustic and Ultrasonic Sensors
-Sensor Packaging
-Sensor Networks
-Sensor Applications
-Sensor Systems: Signals, Processing, and Interfaces
-Actuators and Sensor Power Systems
-Sensor Signal Processing for high precision and stability (amplification, filtering, linearization, modulation/demodulation) and under harsh conditions (EMC, radiation, humidity, temperature); energy consumption/harvesting
-Sensor Data Processing (soft computing with sensor data, e.g., pattern recognition, machine learning, evolutionary computation; sensor data fusion, processing of wave e.g., electromagnetic and acoustic; and non-wave, e.g., chemical, gravity, particle, thermal, radiative and non-radiative sensor data, detection, estimation and classification based on sensor data)
-Sensors in Industrial Practice