Jian Shi , Yang Yu , Bin Hui , Junze Shi , Haibo Luo
{"title":"FSTrack: Visual tracking with feature fusion and adaptive selection","authors":"Jian Shi , Yang Yu , Bin Hui , Junze Shi , Haibo Luo","doi":"10.1016/j.eswa.2025.129895","DOIUrl":null,"url":null,"abstract":"<div><div>Visual object tracking represents a critical research domain within computer vision, with significant applications spanning security surveillance, autonomous navigation, and other fields. Throughout the tracking process, distractors and target appearance variations frequently arise, rendering sole reliance on initial templates unreliable. Therefore, the effective integration of spatiotemporal information and search region features plays a crucial role in achieving robust long-term single-object tracking. However, most existing methods indiscriminately incorporate all historical features as spatiotemporal context, potentially introducing irrelevant or redundant information that undermines tracking reliability. To address this limitation while more effectively exploiting backbone features, we propose FSTrack, which leverages feature fusion to enhance search features and adaptively selects features to strengthen spatiotemporal features. First, we integrate multi-level backbone features through feature fusion and enhance feature resolution, thereby fully exploiting the multi-scale features of the backbone networks. Second, we introduce an adaptive feature selection mechanism that dynamically identifies and emphasizes discriminative historical features, enhancing the robustness of spatiotemporal modeling under diverse tracking scenarios. Third, we propose a globally contextual prediction head that overcomes the limitation of the limited receptive field inherent in conventional CNN-based heads and further improving the overall performance. Extensive experiments demonstrate the superiority of FSTrack. On mainstream benchmark datasets such as GOT-10k, TrackingNet, and LaSOT, our approach outperforms mainstream models using both the same and higher resolution inputs in terms of speed and accuracy, achieving state-of-the-art results on tracking benchmarks.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"298 ","pages":"Article 129895"},"PeriodicalIF":7.5000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425035109","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Visual object tracking represents a critical research domain within computer vision, with significant applications spanning security surveillance, autonomous navigation, and other fields. Throughout the tracking process, distractors and target appearance variations frequently arise, rendering sole reliance on initial templates unreliable. Therefore, the effective integration of spatiotemporal information and search region features plays a crucial role in achieving robust long-term single-object tracking. However, most existing methods indiscriminately incorporate all historical features as spatiotemporal context, potentially introducing irrelevant or redundant information that undermines tracking reliability. To address this limitation while more effectively exploiting backbone features, we propose FSTrack, which leverages feature fusion to enhance search features and adaptively selects features to strengthen spatiotemporal features. First, we integrate multi-level backbone features through feature fusion and enhance feature resolution, thereby fully exploiting the multi-scale features of the backbone networks. Second, we introduce an adaptive feature selection mechanism that dynamically identifies and emphasizes discriminative historical features, enhancing the robustness of spatiotemporal modeling under diverse tracking scenarios. Third, we propose a globally contextual prediction head that overcomes the limitation of the limited receptive field inherent in conventional CNN-based heads and further improving the overall performance. Extensive experiments demonstrate the superiority of FSTrack. On mainstream benchmark datasets such as GOT-10k, TrackingNet, and LaSOT, our approach outperforms mainstream models using both the same and higher resolution inputs in terms of speed and accuracy, achieving state-of-the-art results on tracking benchmarks.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.