MMCATrack: Multi-Modal Channel Attention Tracker

IF 1.3 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision Pub Date : 2026-03-07 DOI:10.1049/cvi2.70060

Zhiqiang Zhao, Daitu Wen, Yuanhang Gu, Xiaoli Luo, Tao Ma, Xu Ma, Bin Wu

{"title":"MMCATrack: Multi-Modal Channel Attention Tracker","authors":"Zhiqiang Zhao, Daitu Wen, Yuanhang Gu, Xiaoli Luo, Tao Ma, Xu Ma, Bin Wu","doi":"10.1049/cvi2.70060","DOIUrl":null,"url":null,"abstract":"<p>Most existing Transformer-based visual object tracking methods rely exclusively on the feature map from the last encoder layer for object prediction, thereby overlooking the rich information contained in shallow and intermediate layer feature maps. This limitation reduces the representational capacity of the model. Moreover, current multi-modal tracking frameworks typically construct multi-modal features through simple concatenation, which fails to adequately account for the differential contributions of individual modalities to the final prediction task. As a result, these approaches exhibit an insufficient ability to express key features within the multi-modal representation. To address the aforementioned issues, this paper proposes a multi-modal channel attention tracking algorithm, where a multi-modal channel attention block is incorporated for the purpose of enhancing the representation ability of the key features within the multi-modal features. Specifically, the multi-modal channel attention block first aggregates multi-modal information from the multi-layer feature maps of the encoder through cross layer cascading and then applies channel attention mechanism to dynamically calibrate the channel weights in the generated multi-modal features, thereby enhancing the representation of key features. In addition, this article proposes a new regression loss function to improve localisation accuracy. Finally, abundant experiments conducted on five benchmarks including GOT-10K, TrackingNet, TNL2K, VisEvent and RGBT234 have verified the effectiveness of our theory.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2026-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70060","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.70060","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Most existing Transformer-based visual object tracking methods rely exclusively on the feature map from the last encoder layer for object prediction, thereby overlooking the rich information contained in shallow and intermediate layer feature maps. This limitation reduces the representational capacity of the model. Moreover, current multi-modal tracking frameworks typically construct multi-modal features through simple concatenation, which fails to adequately account for the differential contributions of individual modalities to the final prediction task. As a result, these approaches exhibit an insufficient ability to express key features within the multi-modal representation. To address the aforementioned issues, this paper proposes a multi-modal channel attention tracking algorithm, where a multi-modal channel attention block is incorporated for the purpose of enhancing the representation ability of the key features within the multi-modal features. Specifically, the multi-modal channel attention block first aggregates multi-modal information from the multi-layer feature maps of the encoder through cross layer cascading and then applies channel attention mechanism to dynamically calibrate the channel weights in the generated multi-modal features, thereby enhancing the representation of key features. In addition, this article proposes a new regression loss function to improve localisation accuracy. Finally, abundant experiments conducted on five benchmarks including GOT-10K, TrackingNet, TNL2K, VisEvent and RGBT234 have verified the effectiveness of our theory.

Abstract Image

查看原文本刊更多论文

MMCATrack：多模态通道注意力跟踪器

大多数现有的基于transformer的视觉目标跟踪方法完全依赖于最后一层编码器的特征映射来进行目标预测，从而忽略了浅层和中间层特征映射中包含的丰富信息。这种限制降低了模型的表示能力。此外，目前的多模态跟踪框架通常通过简单的连接来构建多模态特征，这无法充分考虑各个模态对最终预测任务的不同贡献。因此，这些方法在多模态表示中表现出表达关键特征的能力不足。针对上述问题，本文提出了一种多模态通道注意力跟踪算法，该算法在多模态通道注意力块中加入多模态通道注意力块，以增强多模态特征中关键特征的表示能力。具体来说，多模态信道注意块首先通过跨层级联聚合编码器多层特征映射中的多模态信息，然后应用信道注意机制动态校准生成的多模态特征中的信道权重，从而增强关键特征的表征。此外，本文还提出了一种新的回归损失函数来提高定位精度。最后，在GOT-10K、TrackingNet、TNL2K、VisEvent和RGBT234等5个基准上进行了大量的实验，验证了我们理论的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IET Computer Vision 工程技术-工程：电子与电气

CiteScore

3.30

自引率

11.80%

发文量

审稿时长

3.4 months

期刊介绍： IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision. IET Computer Vision welcomes submissions on the following topics: Biologically and perceptually motivated approaches to low level vision (feature detection, etc.); Perceptual grouping and organisation Representation, analysis and matching of 2D and 3D shape Shape-from-X Object recognition Image understanding Learning with visual inputs Motion analysis and object tracking Multiview scene analysis Cognitive approaches in low, mid and high level vision Control in visual systems Colour, reflectance and light Statistical and probabilistic models Face and gesture Surveillance Biometrics and security Robotics Vehicle guidance Automatic model aquisition Medical image analysis and understanding Aerial scene analysis and remote sensing Deep learning models in computer vision Both methodological and applications orientated papers are welcome. Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review. Special Issues Current Call for Papers: Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf