DPT-tracker: Dual pooling transformer for efficient visual tracking

IF 8.4 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yang Fang, Bailian Xie, Uswah Khairuddin, Zijian Min, Bingbing Jiang, Weisheng Li
{"title":"DPT-tracker: Dual pooling transformer for efficient visual tracking","authors":"Yang Fang,&nbsp;Bailian Xie,&nbsp;Uswah Khairuddin,&nbsp;Zijian Min,&nbsp;Bingbing Jiang,&nbsp;Weisheng Li","doi":"10.1049/cit2.12296","DOIUrl":null,"url":null,"abstract":"<p>Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target-search feature correlation by self and/or cross attention operations, thus the model complexity will grow quadratically with the number of input images. To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer-based trackers, we propose a dual pooling transformer tracking framework, dubbed as DPT, which consists of three components: a simple yet efficient spatiotemporal attention model (SAM), a mutual correlation pooling Transformer (MCPT) and a multiscale aggregation pooling Transformer (MAPT). SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi-frame templates along space-time dimensions. MCPT aims to capture multi-scale pooled and correlated contextual features, which is followed by MAPT that aggregates multi-scale features into a unified feature representation for tracking prediction. DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on TrackingNet while maintaining a shorter sequence length of attention tokens, fewer parameters and FLOPs compared to existing state-of-the-art (SOTA) Transformer tracking methods. Extensive experiments demonstrate that DPT tracker yields a strong real-time tracking baseline with a good trade-off between tracking performance and inference efficiency.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"9 4","pages":"948-959"},"PeriodicalIF":8.4000,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12296","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cit2.12296","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target-search feature correlation by self and/or cross attention operations, thus the model complexity will grow quadratically with the number of input images. To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer-based trackers, we propose a dual pooling transformer tracking framework, dubbed as DPT, which consists of three components: a simple yet efficient spatiotemporal attention model (SAM), a mutual correlation pooling Transformer (MCPT) and a multiscale aggregation pooling Transformer (MAPT). SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi-frame templates along space-time dimensions. MCPT aims to capture multi-scale pooled and correlated contextual features, which is followed by MAPT that aggregates multi-scale features into a unified feature representation for tracking prediction. DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on TrackingNet while maintaining a shorter sequence length of attention tokens, fewer parameters and FLOPs compared to existing state-of-the-art (SOTA) Transformer tracking methods. Extensive experiments demonstrate that DPT tracker yields a strong real-time tracking baseline with a good trade-off between tracking performance and inference efficiency.

Abstract Image

DPT-tracker:用于高效视觉跟踪的双集合变压器
变换器跟踪总是将成对的模板图像和搜索图像作为编码器输入,并通过自注意和/或交叉注意操作进行特征提取和目标-搜索特征关联,因此模型复杂度将随输入图像数量的增加而呈二次方增长。为了减轻这种跟踪范式的负担并促进基于变换器的跟踪器的实际部署,我们提出了一种双集合变换器跟踪框架,称为 DPT,它由三个部分组成:简单而高效的时空注意力模型(SAM)、相互关联集合变换器(MCPT)和多尺度集合变换器(MAPT)。SAM 的设计目的是沿时空维度优雅地聚合多帧模板的时间动态和空间外观信息。MCPT 的目的是捕捉多尺度池化和相关的上下文特征,随后 MAPT 将多尺度特征聚合为统一的特征表示,用于跟踪预测。与现有的最先进(SOTA)变形追踪方法相比,DPT 追踪器在 LaSOT 上的 AUC 得分为 69.5,在 TrackingNet 上的精确度得分为 82.8,同时保持了更短的注意力标记序列长度、更少的参数和 FLOP。广泛的实验证明,DPT 跟踪器具有强大的实时跟踪基线,在跟踪性能和推理效率之间实现了良好的权衡。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CAAI Transactions on Intelligence Technology
CAAI Transactions on Intelligence Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-
CiteScore
11.00
自引率
3.90%
发文量
134
审稿时长
35 weeks
期刊介绍: CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信