Correlation-Embedded Transformer Tracking: A Single-Branch Framework

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-08-22 DOI:10.1109/TPAMI.2024.3448254

Fei Xie;Wankou Yang;Chunyu Wang;Lei Chu;Yue Cao;Chao Ma;Wenjun Zeng

{"title":"Correlation-Embedded Transformer Tracking: A Single-Branch Framework","authors":"Fei Xie;Wankou Yang;Chunyu Wang;Lei Chu;Yue Cao;Chao Ma;Wenjun Zeng","doi":"10.1109/TPAMI.2024.3448254","DOIUrl":null,"url":null,"abstract":"Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used to predict target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10681-10696"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10643566/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used to predict target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.

查看原文本刊更多论文

相关嵌入式变压器跟踪：单分支框架

在视觉物体跟踪领域，开发稳健且具有分辨能力的外观模型是一项长期存在的研究挑战。在流行的基于连体的范例中，由连体网络提取的特征往往不足以对跟踪的目标和分心物体进行建模，从而阻碍了它们同时具有鲁棒性和鉴别性。大多数连体跟踪器都侧重于设计稳健的相关操作，而我们则受变换器的启发，提出了一种新颖的单分支跟踪框架。与连体特征提取不同，我们的跟踪器将跨图像特征相关性深入到多层特征网络中。通过多层广泛匹配两幅图像的特征，它可以抑制非目标特征，从而实现目标感知特征提取。输出特征可直接用于预测目标位置，而无需额外的相关步骤。因此，我们将双分支连体跟踪重新表述为一个概念简单、完全基于变换器的单分支跟踪管道，称为 SBT。在对 SBT 基线进行深入分析后，我们总结了许多有效的设计原则，并提出了一种改进的跟踪器，称为 SuperSBT。SuperSBT 采用分层架构和局部建模层来增强浅层特征。提出了一种统一的关系建模，以去除复杂的手工层模式设计。通过屏蔽图像建模预训练、整合时序建模和配备专用预测头，SuperSBT 得到了进一步改进。因此，在 LaSOT、TrackingNet 和 GOT-10K 中，SuperSBT 的 AUC 分数分别比 SBT 基线高出 4.7%、3.0% 和 4.5%。值得注意的是，SuperSBT 将 SBT 的速度从 37 FPS 大幅提高到 81 FPS。广泛的实验表明，我们的方法在八个 VOT 基准上取得了优异的成绩。代码见 https://github.com/phiphiphi31/SBT。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量