{"title":"HTransT++: Hierarchical Transformer with Temporal Memory and Spatial Attention for Visual Tracking","authors":"Zhixue Liang, Wenyong Dong, Bo Zhang","doi":"10.1109/ICNSC55942.2022.10004052","DOIUrl":null,"url":null,"abstract":"Transformer-based architectures have recently witnessed significant progress in visual object tracking. However, most transformer-based trackers adopt hybrid networks, which use the convolutional neural networks (CNNs) to extract the features and the transformers to fuse and enhance them. Furthermore, most of transformer-based trackers only consider spatial dependencies between the target object and the search region, but ignore temporal relations. Simultaneously considered the temporal and spatial properties inherent in video sequences, this paper presents a hierarchical transformer with temporal memory and spatial attention network for visual tracking, named HTransT ++. The proposed network employs a hierarchical transformer as the backbone to extract multi-level features. By adopting transformer-based encoder and decoder to fuse historic template features and search region image features, the spatial and temporal dependencies across video frames are captured in tracking. Extensive experiments show that our proposed method (HTransT ++) achieves outstanding performance on four visual tracking benchmarks, including VOT2018, GOT-10K, TrackingNet, and LaSOT, while running at real-time speed.","PeriodicalId":230499,"journal":{"name":"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNSC55942.2022.10004052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Transformer-based architectures have recently witnessed significant progress in visual object tracking. However, most transformer-based trackers adopt hybrid networks, which use the convolutional neural networks (CNNs) to extract the features and the transformers to fuse and enhance them. Furthermore, most of transformer-based trackers only consider spatial dependencies between the target object and the search region, but ignore temporal relations. Simultaneously considered the temporal and spatial properties inherent in video sequences, this paper presents a hierarchical transformer with temporal memory and spatial attention network for visual tracking, named HTransT ++. The proposed network employs a hierarchical transformer as the backbone to extract multi-level features. By adopting transformer-based encoder and decoder to fuse historic template features and search region image features, the spatial and temporal dependencies across video frames are captured in tracking. Extensive experiments show that our proposed method (HTransT ++) achieves outstanding performance on four visual tracking benchmarks, including VOT2018, GOT-10K, TrackingNet, and LaSOT, while running at real-time speed.