Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank
{"title":"用信使跟踪双流变压器","authors":"Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank","doi":"10.1016/j.imavis.2025.105510","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, one-stream trackers gradually surpass two-stream trackers and become popular due to their higher accuracy. However, they suffer from a substantial amount of computational redundancy and an increased inference latency. This paper combines the speed advantage of two-stream trackers with the accuracy advantage of one-stream trackers, and proposes a new two-stream Transformer tracker called MesTrack. The core designs of MesTrack lie in the messenger tokens and the message integration module. The messenger tokens obtain the target-specific information during the feature extraction stage of the template branch, while the message integration module integrates the target-specific information from the template branch into the search branch. To further improve accuracy, this paper proposes an adaptive label smoothing knowledge distillation training scheme. This scheme uses the weighted sum of the teacher model’s prediction and the ground truth as supervisory information to guide the training of the student model. The weighting coefficients, which are predicted by the student model, are used to maintain the useful complementary information from the teacher model while simultaneously correcting its erroneous predictions. Evaluation on multiple popular tracking datasets show that MesTrack achieves competitive results. On the LaSOT dataset, the MesTrack-B-384 version achieves a SUC (success rate) score of 73.8%, reaching the SOTA (state of the art) performance, at an inference speed of 69.2 FPS (frames per second). When deployed with TensorRT, the speed can be further improved to 122.6 FPS.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105510"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Two-stream transformer tracking with messengers\",\"authors\":\"Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank\",\"doi\":\"10.1016/j.imavis.2025.105510\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, one-stream trackers gradually surpass two-stream trackers and become popular due to their higher accuracy. However, they suffer from a substantial amount of computational redundancy and an increased inference latency. This paper combines the speed advantage of two-stream trackers with the accuracy advantage of one-stream trackers, and proposes a new two-stream Transformer tracker called MesTrack. The core designs of MesTrack lie in the messenger tokens and the message integration module. The messenger tokens obtain the target-specific information during the feature extraction stage of the template branch, while the message integration module integrates the target-specific information from the template branch into the search branch. To further improve accuracy, this paper proposes an adaptive label smoothing knowledge distillation training scheme. This scheme uses the weighted sum of the teacher model’s prediction and the ground truth as supervisory information to guide the training of the student model. The weighting coefficients, which are predicted by the student model, are used to maintain the useful complementary information from the teacher model while simultaneously correcting its erroneous predictions. Evaluation on multiple popular tracking datasets show that MesTrack achieves competitive results. On the LaSOT dataset, the MesTrack-B-384 version achieves a SUC (success rate) score of 73.8%, reaching the SOTA (state of the art) performance, at an inference speed of 69.2 FPS (frames per second). When deployed with TensorRT, the speed can be further improved to 122.6 FPS.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"158 \",\"pages\":\"Article 105510\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625000988\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000988","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Recently, one-stream trackers gradually surpass two-stream trackers and become popular due to their higher accuracy. However, they suffer from a substantial amount of computational redundancy and an increased inference latency. This paper combines the speed advantage of two-stream trackers with the accuracy advantage of one-stream trackers, and proposes a new two-stream Transformer tracker called MesTrack. The core designs of MesTrack lie in the messenger tokens and the message integration module. The messenger tokens obtain the target-specific information during the feature extraction stage of the template branch, while the message integration module integrates the target-specific information from the template branch into the search branch. To further improve accuracy, this paper proposes an adaptive label smoothing knowledge distillation training scheme. This scheme uses the weighted sum of the teacher model’s prediction and the ground truth as supervisory information to guide the training of the student model. The weighting coefficients, which are predicted by the student model, are used to maintain the useful complementary information from the teacher model while simultaneously correcting its erroneous predictions. Evaluation on multiple popular tracking datasets show that MesTrack achieves competitive results. On the LaSOT dataset, the MesTrack-B-384 version achieves a SUC (success rate) score of 73.8%, reaching the SOTA (state of the art) performance, at an inference speed of 69.2 FPS (frames per second). When deployed with TensorRT, the speed can be further improved to 122.6 FPS.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.