{"title":"基于变换器的视觉跟踪器,具有受限标记交互和知识提炼功能","authors":"Nian Liu, Yi Zhang","doi":"10.1016/j.knosys.2024.112736","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, one-stream pipelines have made significant progress in visual object tracking (VOT), where the template and search images interact in early stages. However, one-stream pipelines have a potential problem: They treat the object and the background equally (or other irrelevant parts), leading to weak discriminability of the extracted features. To remedy this issue, a restricted token interaction module based on asymmetric attention mechanism is proposed in this paper, which divides the search image into valuable part and other part. Only the valuable part is selected for cross-attention with the template so as to better distinguish the object from the background, which finally improves the localization accuracy and robustness. In addition, to avoid heavy computational overhead, we utilize logit distillation and localization distillation methods to optimize the outputs of the classification and regression heads respectively. At the same time, we separate the distillation regions and apply different knowledge distillation methods in different regions to effectively determine which regions are most beneficial for classification or localization learning. Extensive experiments have been conducted on mainstream datasets in which our tracker (dubbed RIDTrack) has achieved appealing results while meeting the real-time requirement.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"307 ","pages":"Article 112736"},"PeriodicalIF":7.2000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A transformer based visual tracker with restricted token interaction and knowledge distillation\",\"authors\":\"Nian Liu, Yi Zhang\",\"doi\":\"10.1016/j.knosys.2024.112736\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, one-stream pipelines have made significant progress in visual object tracking (VOT), where the template and search images interact in early stages. However, one-stream pipelines have a potential problem: They treat the object and the background equally (or other irrelevant parts), leading to weak discriminability of the extracted features. To remedy this issue, a restricted token interaction module based on asymmetric attention mechanism is proposed in this paper, which divides the search image into valuable part and other part. Only the valuable part is selected for cross-attention with the template so as to better distinguish the object from the background, which finally improves the localization accuracy and robustness. In addition, to avoid heavy computational overhead, we utilize logit distillation and localization distillation methods to optimize the outputs of the classification and regression heads respectively. At the same time, we separate the distillation regions and apply different knowledge distillation methods in different regions to effectively determine which regions are most beneficial for classification or localization learning. Extensive experiments have been conducted on mainstream datasets in which our tracker (dubbed RIDTrack) has achieved appealing results while meeting the real-time requirement.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"307 \",\"pages\":\"Article 112736\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2024-11-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705124013704\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124013704","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A transformer based visual tracker with restricted token interaction and knowledge distillation
Recently, one-stream pipelines have made significant progress in visual object tracking (VOT), where the template and search images interact in early stages. However, one-stream pipelines have a potential problem: They treat the object and the background equally (or other irrelevant parts), leading to weak discriminability of the extracted features. To remedy this issue, a restricted token interaction module based on asymmetric attention mechanism is proposed in this paper, which divides the search image into valuable part and other part. Only the valuable part is selected for cross-attention with the template so as to better distinguish the object from the background, which finally improves the localization accuracy and robustness. In addition, to avoid heavy computational overhead, we utilize logit distillation and localization distillation methods to optimize the outputs of the classification and regression heads respectively. At the same time, we separate the distillation regions and apply different knowledge distillation methods in different regions to effectively determine which regions are most beneficial for classification or localization learning. Extensive experiments have been conducted on mainstream datasets in which our tracker (dubbed RIDTrack) has achieved appealing results while meeting the real-time requirement.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.