Two-stream transformer tracking with messengers

IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank
{"title":"Two-stream transformer tracking with messengers","authors":"Miaobo Qiu ,&nbsp;Wenyang Luo ,&nbsp;Tongfei Liu ,&nbsp;Yanqin Jiang ,&nbsp;Jiaming Yan ,&nbsp;Wenjuan Li ,&nbsp;Jin Gao ,&nbsp;Weiming Hu ,&nbsp;Stephen Maybank","doi":"10.1016/j.imavis.2025.105510","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, one-stream trackers gradually surpass two-stream trackers and become popular due to their higher accuracy. However, they suffer from a substantial amount of computational redundancy and an increased inference latency. This paper combines the speed advantage of two-stream trackers with the accuracy advantage of one-stream trackers, and proposes a new two-stream Transformer tracker called MesTrack. The core designs of MesTrack lie in the messenger tokens and the message integration module. The messenger tokens obtain the target-specific information during the feature extraction stage of the template branch, while the message integration module integrates the target-specific information from the template branch into the search branch. To further improve accuracy, this paper proposes an adaptive label smoothing knowledge distillation training scheme. This scheme uses the weighted sum of the teacher model’s prediction and the ground truth as supervisory information to guide the training of the student model. The weighting coefficients, which are predicted by the student model, are used to maintain the useful complementary information from the teacher model while simultaneously correcting its erroneous predictions. Evaluation on multiple popular tracking datasets show that MesTrack achieves competitive results. On the LaSOT dataset, the MesTrack-B-384 version achieves a SUC (success rate) score of 73.8%, reaching the SOTA (state of the art) performance, at an inference speed of 69.2 FPS (frames per second). When deployed with TensorRT, the speed can be further improved to 122.6 FPS.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105510"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000988","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Recently, one-stream trackers gradually surpass two-stream trackers and become popular due to their higher accuracy. However, they suffer from a substantial amount of computational redundancy and an increased inference latency. This paper combines the speed advantage of two-stream trackers with the accuracy advantage of one-stream trackers, and proposes a new two-stream Transformer tracker called MesTrack. The core designs of MesTrack lie in the messenger tokens and the message integration module. The messenger tokens obtain the target-specific information during the feature extraction stage of the template branch, while the message integration module integrates the target-specific information from the template branch into the search branch. To further improve accuracy, this paper proposes an adaptive label smoothing knowledge distillation training scheme. This scheme uses the weighted sum of the teacher model’s prediction and the ground truth as supervisory information to guide the training of the student model. The weighting coefficients, which are predicted by the student model, are used to maintain the useful complementary information from the teacher model while simultaneously correcting its erroneous predictions. Evaluation on multiple popular tracking datasets show that MesTrack achieves competitive results. On the LaSOT dataset, the MesTrack-B-384 version achieves a SUC (success rate) score of 73.8%, reaching the SOTA (state of the art) performance, at an inference speed of 69.2 FPS (frames per second). When deployed with TensorRT, the speed can be further improved to 122.6 FPS.
用信使跟踪双流变压器
近年来,单流跟踪器逐渐超越双流跟踪器,以其更高的精度而受到欢迎。然而,它们受到大量计算冗余和增加的推理延迟的影响。本文结合双流跟踪器的速度优势和单流跟踪器的精度优势,提出了一种新的双流变压器跟踪器MesTrack。MesTrack的核心设计在于消息令牌和消息集成模块。信使令牌在模板分支的特征提取阶段获取目标特定信息,消息集成模块将模板分支的目标特定信息集成到搜索分支中。为了进一步提高准确率,本文提出了一种自适应标签平滑知识蒸馏训练方案。该方案使用教师模型的预测值与真实值的加权和作为监督信息来指导学生模型的训练。由学生模型预测的权重系数用于维护来自教师模型的有用补充信息,同时纠正其错误预测。对多个流行的跟踪数据集的评估表明,MesTrack取得了具有竞争力的结果。在LaSOT数据集上,MesTrack-B-384版本达到了73.8%的SUC(成功率)分数,达到了SOTA(最先进的)性能,推理速度为69.2 FPS(每秒帧数)。当使用TensorRT部署时,速度可以进一步提高到122.6 FPS。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Image and Vision Computing
Image and Vision Computing 工程技术-工程:电子与电气
CiteScore
8.50
自引率
8.50%
发文量
143
审稿时长
7.8 months
期刊介绍: Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信