用信使跟踪双流变压器

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-03-29 DOI:10.1016/j.imavis.2025.105510

Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank

{"title":"用信使跟踪双流变压器","authors":"Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank","doi":"10.1016/j.imavis.2025.105510","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, one-stream trackers gradually surpass two-stream trackers and become popular due to their higher accuracy. However, they suffer from a substantial amount of computational redundancy and an increased inference latency. This paper combines the speed advantage of two-stream trackers with the accuracy advantage of one-stream trackers, and proposes a new two-stream Transformer tracker called MesTrack. The core designs of MesTrack lie in the messenger tokens and the message integration module. The messenger tokens obtain the target-specific information during the feature extraction stage of the template branch, while the message integration module integrates the target-specific information from the template branch into the search branch. To further improve accuracy, this paper proposes an adaptive label smoothing knowledge distillation training scheme. This scheme uses the weighted sum of the teacher model’s prediction and the ground truth as supervisory information to guide the training of the student model. The weighting coefficients, which are predicted by the student model, are used to maintain the useful complementary information from the teacher model while simultaneously correcting its erroneous predictions. Evaluation on multiple popular tracking datasets show that MesTrack achieves competitive results. On the LaSOT dataset, the MesTrack-B-384 version achieves a SUC (success rate) score of 73.8%, reaching the SOTA (state of the art) performance, at an inference speed of 69.2 FPS (frames per second). When deployed with TensorRT, the speed can be further improved to 122.6 FPS.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105510"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Two-stream transformer tracking with messengers\",\"authors\":\"Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank\",\"doi\":\"10.1016/j.imavis.2025.105510\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, one-stream trackers gradually surpass two-stream trackers and become popular due to their higher accuracy. However, they suffer from a substantial amount of computational redundancy and an increased inference latency. This paper combines the speed advantage of two-stream trackers with the accuracy advantage of one-stream trackers, and proposes a new two-stream Transformer tracker called MesTrack. The core designs of MesTrack lie in the messenger tokens and the message integration module. The messenger tokens obtain the target-specific information during the feature extraction stage of the template branch, while the message integration module integrates the target-specific information from the template branch into the search branch. To further improve accuracy, this paper proposes an adaptive label smoothing knowledge distillation training scheme. This scheme uses the weighted sum of the teacher model’s prediction and the ground truth as supervisory information to guide the training of the student model. The weighting coefficients, which are predicted by the student model, are used to maintain the useful complementary information from the teacher model while simultaneously correcting its erroneous predictions. Evaluation on multiple popular tracking datasets show that MesTrack achieves competitive results. On the LaSOT dataset, the MesTrack-B-384 version achieves a SUC (success rate) score of 73.8%, reaching the SOTA (state of the art) performance, at an inference speed of 69.2 FPS (frames per second). When deployed with TensorRT, the speed can be further improved to 122.6 FPS.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"158 \",\"pages\":\"Article 105510\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625000988\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000988","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

近年来，单流跟踪器逐渐超越双流跟踪器，以其更高的精度而受到欢迎。然而，它们受到大量计算冗余和增加的推理延迟的影响。本文结合双流跟踪器的速度优势和单流跟踪器的精度优势，提出了一种新的双流变压器跟踪器MesTrack。MesTrack的核心设计在于消息令牌和消息集成模块。信使令牌在模板分支的特征提取阶段获取目标特定信息，消息集成模块将模板分支的目标特定信息集成到搜索分支中。为了进一步提高准确率，本文提出了一种自适应标签平滑知识蒸馏训练方案。该方案使用教师模型的预测值与真实值的加权和作为监督信息来指导学生模型的训练。由学生模型预测的权重系数用于维护来自教师模型的有用补充信息，同时纠正其错误预测。对多个流行的跟踪数据集的评估表明，MesTrack取得了具有竞争力的结果。在LaSOT数据集上，MesTrack-B-384版本达到了73.8%的SUC（成功率）分数，达到了SOTA（最先进的）性能，推理速度为69.2 FPS（每秒帧数）。当使用TensorRT部署时，速度可以进一步提高到122.6 FPS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Two-stream transformer tracking with messengers

Recently, one-stream trackers gradually surpass two-stream trackers and become popular due to their higher accuracy. However, they suffer from a substantial amount of computational redundancy and an increased inference latency. This paper combines the speed advantage of two-stream trackers with the accuracy advantage of one-stream trackers, and proposes a new two-stream Transformer tracker called MesTrack. The core designs of MesTrack lie in the messenger tokens and the message integration module. The messenger tokens obtain the target-specific information during the feature extraction stage of the template branch, while the message integration module integrates the target-specific information from the template branch into the search branch. To further improve accuracy, this paper proposes an adaptive label smoothing knowledge distillation training scheme. This scheme uses the weighted sum of the teacher model’s prediction and the ground truth as supervisory information to guide the training of the student model. The weighting coefficients, which are predicted by the student model, are used to maintain the useful complementary information from the teacher model while simultaneously correcting its erroneous predictions. Evaluation on multiple popular tracking datasets show that MesTrack achieves competitive results. On the LaSOT dataset, the MesTrack-B-384 version achieves a SUC (success rate) score of 73.8%, reaching the SOTA (state of the art) performance, at an inference speed of 69.2 FPS (frames per second). When deployed with TensorRT, the speed can be further improved to 122.6 FPS.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.