Shilei Wang;Zhenhua Wang;Qianqian Sun;Gong Cheng;Jifeng Ning
{"title":"Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking","authors":"Shilei Wang;Zhenhua Wang;Qianqian Sun;Gong Cheng;Jifeng Ning","doi":"10.1109/TIP.2024.3453028","DOIUrl":null,"url":null,"abstract":"Recently, one-stream trackers have achieved parallel feature extraction and relation modeling through the exploitation of Transformer-based architectures. This design greatly improves the performance of trackers. However, as one-stream trackers often overlook crucial tracking cues beyond the template, they prone to give unsatisfactory results against complex tracking scenarios. To tackle these challenges, we propose a multi-cue single-stream tracker, dubbed MCTrack here, which seamlessly integrates template information, historical trajectory, historical frame, and the search region for synchronized feature extraction and relation modeling. To achieve this, we employ two types of encoders to convert the template, historical frames, search region, and historical trajectory into tokens, which are then collectively fed into a Transformer architecture. To distill temporal and spatial cues, we introduce a novel adaptive update mechanism, which incorporates a thresholding component and a local multi-peak component to filter out less accurate and overly disturbed tracking cues. Empirically, MCTrack achieves leading performance on mainstream benchmark datasets, surpassing the most advanced SeqTrack by 2.0% in terms of the AO metric on GOT-10k. The code is available at \n<uri>https://github.com/wsumel/MCTrack</uri>\n.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5073-5085"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10670064/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, one-stream trackers have achieved parallel feature extraction and relation modeling through the exploitation of Transformer-based architectures. This design greatly improves the performance of trackers. However, as one-stream trackers often overlook crucial tracking cues beyond the template, they prone to give unsatisfactory results against complex tracking scenarios. To tackle these challenges, we propose a multi-cue single-stream tracker, dubbed MCTrack here, which seamlessly integrates template information, historical trajectory, historical frame, and the search region for synchronized feature extraction and relation modeling. To achieve this, we employ two types of encoders to convert the template, historical frames, search region, and historical trajectory into tokens, which are then collectively fed into a Transformer architecture. To distill temporal and spatial cues, we introduce a novel adaptive update mechanism, which incorporates a thresholding component and a local multi-peak component to filter out less accurate and overly disturbed tracking cues. Empirically, MCTrack achieves leading performance on mainstream benchmark datasets, surpassing the most advanced SeqTrack by 2.0% in terms of the AO metric on GOT-10k. The code is available at
https://github.com/wsumel/MCTrack
.