基于通道和空间特征融合的变压器RGB-T跟踪

IF 4.3 2区 综合性期刊 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Yunfeng Li;Bo Wang;Ye Li
{"title":"基于通道和空间特征融合的变压器RGB-T跟踪","authors":"Yunfeng Li;Bo Wang;Ye Li","doi":"10.1109/JSEN.2025.3579339","DOIUrl":null,"url":null,"abstract":"The main problem in RGB-thermal (RGB-T) tracking is the correct and optimal merging of the cross-modal features of visible and thermal images. Some previous methods either do not fully exploit the potential of RGB and TIR information for channel and spatial feature fusion or lack a direct interaction between the template and the search area, which limits the model’s ability to fully utilize the original semantic information of both modalities. To address these limitations, we investigate how to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking and propose channel and spatial transformer network (CSTNet). It uses the vision transformer (ViT) as the backbone and adds a joint spatial and channel fusion module (JSCFM) and spatial fusion module (SFM) integrated between the transformer blocks to facilitate cross-modal feature interaction. The JSCFM module achieves joint modeling of channel and multilevel spatial features. The SFM module includes a cross-attention-like architecture for cross modeling and joint learning of RGB and TIR features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance. To enhance practicality, we retrain the model without JSCFM and SFM modules and use CSNet as the pretraining weight, and propose CSTNet-small, which achieves 50% speedup with an average decrease of 1%–2% in SR and PR performance. CSTNet and CSTNet-small achieve real-time speeds of 21 and 33 frames/s on the Nvidia Jetson Xavier, meeting actual deployment requirements. Code is available at <uri>https://github.com/LiYunfengLYF/CSTNet</uri>","PeriodicalId":447,"journal":{"name":"IEEE Sensors Journal","volume":"25 15","pages":"28891-28904"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-Based RGB-T Tracking With Channel and Spatial Feature Fusion\",\"authors\":\"Yunfeng Li;Bo Wang;Ye Li\",\"doi\":\"10.1109/JSEN.2025.3579339\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The main problem in RGB-thermal (RGB-T) tracking is the correct and optimal merging of the cross-modal features of visible and thermal images. Some previous methods either do not fully exploit the potential of RGB and TIR information for channel and spatial feature fusion or lack a direct interaction between the template and the search area, which limits the model’s ability to fully utilize the original semantic information of both modalities. To address these limitations, we investigate how to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking and propose channel and spatial transformer network (CSTNet). It uses the vision transformer (ViT) as the backbone and adds a joint spatial and channel fusion module (JSCFM) and spatial fusion module (SFM) integrated between the transformer blocks to facilitate cross-modal feature interaction. The JSCFM module achieves joint modeling of channel and multilevel spatial features. The SFM module includes a cross-attention-like architecture for cross modeling and joint learning of RGB and TIR features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance. To enhance practicality, we retrain the model without JSCFM and SFM modules and use CSNet as the pretraining weight, and propose CSTNet-small, which achieves 50% speedup with an average decrease of 1%–2% in SR and PR performance. CSTNet and CSTNet-small achieve real-time speeds of 21 and 33 frames/s on the Nvidia Jetson Xavier, meeting actual deployment requirements. Code is available at <uri>https://github.com/LiYunfengLYF/CSTNet</uri>\",\"PeriodicalId\":447,\"journal\":{\"name\":\"IEEE Sensors Journal\",\"volume\":\"25 15\",\"pages\":\"28891-28904\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Sensors Journal\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11040139/\",\"RegionNum\":2,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Sensors Journal","FirstCategoryId":"103","ListUrlMain":"https://ieeexplore.ieee.org/document/11040139/","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

RGB-T跟踪的主要问题是正确和最优地融合可见光图像和热图像的跨模态特征。以前的一些方法要么没有充分利用RGB和TIR信息在通道和空间特征融合方面的潜力,要么缺乏模板与搜索区域之间的直接交互,这限制了模型充分利用两种模式的原始语义信息的能力。为了解决这些限制,我们研究了如何在RGB-T跟踪中实现跨模态通道和空间特征的直接融合,并提出了通道和空间变压器网络(CSTNet)。它以视觉变压器(ViT)为骨干,在变压器块之间增加了集成的联合空间与信道融合模块(JSCFM)和空间融合模块(SFM),便于跨模态特征交互。JSCFM模块实现了通道和多层空间特征的联合建模。SFM模块包括一个类似交叉注意的架构,用于交叉建模和RGB和TIR特征的联合学习。综合实验表明,CSTNet达到了最先进的性能。为了提高实用性,我们在不使用JSCFM和SFM模块的情况下重新训练模型,并使用CSNet作为预训练权值,提出了CSTNet-small模型,该模型在SR和PR性能平均下降1%-2%的情况下实现了50%的加速。CSTNet和CSTNet-small在Nvidia Jetson Xavier上实现了21帧/秒和33帧/秒的实时速度,满足实际部署要求。代码可从https://github.com/LiYunfengLYF/CSTNet获得
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Transformer-Based RGB-T Tracking With Channel and Spatial Feature Fusion
The main problem in RGB-thermal (RGB-T) tracking is the correct and optimal merging of the cross-modal features of visible and thermal images. Some previous methods either do not fully exploit the potential of RGB and TIR information for channel and spatial feature fusion or lack a direct interaction between the template and the search area, which limits the model’s ability to fully utilize the original semantic information of both modalities. To address these limitations, we investigate how to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking and propose channel and spatial transformer network (CSTNet). It uses the vision transformer (ViT) as the backbone and adds a joint spatial and channel fusion module (JSCFM) and spatial fusion module (SFM) integrated between the transformer blocks to facilitate cross-modal feature interaction. The JSCFM module achieves joint modeling of channel and multilevel spatial features. The SFM module includes a cross-attention-like architecture for cross modeling and joint learning of RGB and TIR features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance. To enhance practicality, we retrain the model without JSCFM and SFM modules and use CSNet as the pretraining weight, and propose CSTNet-small, which achieves 50% speedup with an average decrease of 1%–2% in SR and PR performance. CSTNet and CSTNet-small achieve real-time speeds of 21 and 33 frames/s on the Nvidia Jetson Xavier, meeting actual deployment requirements. Code is available at https://github.com/LiYunfengLYF/CSTNet
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Sensors Journal
IEEE Sensors Journal 工程技术-工程:电子与电气
CiteScore
7.70
自引率
14.00%
发文量
2058
审稿时长
5.2 months
期刊介绍: The fields of interest of the IEEE Sensors Journal are the theory, design , fabrication, manufacturing and applications of devices for sensing and transducing physical, chemical and biological phenomena, with emphasis on the electronics and physics aspect of sensors and integrated sensors-actuators. IEEE Sensors Journal deals with the following: -Sensor Phenomenology, Modelling, and Evaluation -Sensor Materials, Processing, and Fabrication -Chemical and Gas Sensors -Microfluidics and Biosensors -Optical Sensors -Physical Sensors: Temperature, Mechanical, Magnetic, and others -Acoustic and Ultrasonic Sensors -Sensor Packaging -Sensor Networks -Sensor Applications -Sensor Systems: Signals, Processing, and Interfaces -Actuators and Sensor Power Systems -Sensor Signal Processing for high precision and stability (amplification, filtering, linearization, modulation/demodulation) and under harsh conditions (EMC, radiation, humidity, temperature); energy consumption/harvesting -Sensor Data Processing (soft computing with sensor data, e.g., pattern recognition, machine learning, evolutionary computation; sensor data fusion, processing of wave e.g., electromagnetic and acoustic; and non-wave, e.g., chemical, gravity, particle, thermal, radiative and non-radiative sensor data, detection, estimation and classification based on sensor data) -Sensors in Industrial Practice
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信