Si Chen;Rui Xu;Yan Yan;Yang Hua;Da-Han Wang;Shunzhi Zhu
{"title":"鲁棒视觉跟踪的分层注意增强相关细化","authors":"Si Chen;Rui Xu;Yan Yan;Yang Hua;Da-Han Wang;Shunzhi Zhu","doi":"10.1109/TITS.2025.3570076","DOIUrl":null,"url":null,"abstract":"In recent years, visual tracking has witnessed remarkable advancements with the exploration of feature extraction and correlation modeling techniques. However, inadequate robustness of either the backbone network or the correlation operation continues to plague existing trackers, leading to frustrating drift when confronted with similar distractors or cluttered backgrounds. To address this problem, we propose a hierarchical attention-enhanced correlation refinement network (HarNet) for achieving robust visual tracking. Specifically, a gated dual-view attention (GDA) module is first designed to aggregate the intra-layer attention and the inter-layer self-attention based on a fusion gate, so as to enhance hierarchical feature representations of the template. Meanwhile, a target-aware attention (TA) module introduces the template information to the inter-layer self-attention, which can highlight the target information in the search region. Moreover, a graph guided correlation (GGC) module leverages the pixel-to-local and pixel-to-global correlations to fully exploit both local- and global-spatial information between the template and the search region, and then uses the graph convolutional network (GCN) to further learn the node relationships of the correlation map for more finegrained correlations. Thus, with the above three elaborately designed modules, the HarNet is beneficial for the enhancement of feature representation and the precise localization of the target. Extensive experiments on popular visual tracking datasets (including OTB100, VOT2016, VOT2018, VOT2019, UAV123, UAV20L, GOT-10k, and LaSOT) demonstrate the superiority of our proposed method against several state-of-the-art tracking methods.","PeriodicalId":13416,"journal":{"name":"IEEE Transactions on Intelligent Transportation Systems","volume":"26 7","pages":"9370-9386"},"PeriodicalIF":7.9000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hierarchical Attention-Enhanced Correlation Refinement for Robust Visual Tracking\",\"authors\":\"Si Chen;Rui Xu;Yan Yan;Yang Hua;Da-Han Wang;Shunzhi Zhu\",\"doi\":\"10.1109/TITS.2025.3570076\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, visual tracking has witnessed remarkable advancements with the exploration of feature extraction and correlation modeling techniques. However, inadequate robustness of either the backbone network or the correlation operation continues to plague existing trackers, leading to frustrating drift when confronted with similar distractors or cluttered backgrounds. To address this problem, we propose a hierarchical attention-enhanced correlation refinement network (HarNet) for achieving robust visual tracking. Specifically, a gated dual-view attention (GDA) module is first designed to aggregate the intra-layer attention and the inter-layer self-attention based on a fusion gate, so as to enhance hierarchical feature representations of the template. Meanwhile, a target-aware attention (TA) module introduces the template information to the inter-layer self-attention, which can highlight the target information in the search region. Moreover, a graph guided correlation (GGC) module leverages the pixel-to-local and pixel-to-global correlations to fully exploit both local- and global-spatial information between the template and the search region, and then uses the graph convolutional network (GCN) to further learn the node relationships of the correlation map for more finegrained correlations. Thus, with the above three elaborately designed modules, the HarNet is beneficial for the enhancement of feature representation and the precise localization of the target. Extensive experiments on popular visual tracking datasets (including OTB100, VOT2016, VOT2018, VOT2019, UAV123, UAV20L, GOT-10k, and LaSOT) demonstrate the superiority of our proposed method against several state-of-the-art tracking methods.\",\"PeriodicalId\":13416,\"journal\":{\"name\":\"IEEE Transactions on Intelligent Transportation Systems\",\"volume\":\"26 7\",\"pages\":\"9370-9386\"},\"PeriodicalIF\":7.9000,\"publicationDate\":\"2025-06-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Intelligent Transportation Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11023144/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, CIVIL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Intelligent Transportation Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11023144/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, CIVIL","Score":null,"Total":0}
Hierarchical Attention-Enhanced Correlation Refinement for Robust Visual Tracking
In recent years, visual tracking has witnessed remarkable advancements with the exploration of feature extraction and correlation modeling techniques. However, inadequate robustness of either the backbone network or the correlation operation continues to plague existing trackers, leading to frustrating drift when confronted with similar distractors or cluttered backgrounds. To address this problem, we propose a hierarchical attention-enhanced correlation refinement network (HarNet) for achieving robust visual tracking. Specifically, a gated dual-view attention (GDA) module is first designed to aggregate the intra-layer attention and the inter-layer self-attention based on a fusion gate, so as to enhance hierarchical feature representations of the template. Meanwhile, a target-aware attention (TA) module introduces the template information to the inter-layer self-attention, which can highlight the target information in the search region. Moreover, a graph guided correlation (GGC) module leverages the pixel-to-local and pixel-to-global correlations to fully exploit both local- and global-spatial information between the template and the search region, and then uses the graph convolutional network (GCN) to further learn the node relationships of the correlation map for more finegrained correlations. Thus, with the above three elaborately designed modules, the HarNet is beneficial for the enhancement of feature representation and the precise localization of the target. Extensive experiments on popular visual tracking datasets (including OTB100, VOT2016, VOT2018, VOT2019, UAV123, UAV20L, GOT-10k, and LaSOT) demonstrate the superiority of our proposed method against several state-of-the-art tracking methods.
期刊介绍:
The theoretical, experimental and operational aspects of electrical and electronics engineering and information technologies as applied to Intelligent Transportation Systems (ITS). Intelligent Transportation Systems are defined as those systems utilizing synergistic technologies and systems engineering concepts to develop and improve transportation systems of all kinds. The scope of this interdisciplinary activity includes the promotion, consolidation and coordination of ITS technical activities among IEEE entities, and providing a focus for cooperative activities, both internally and externally.