{"title":"具有时间记忆和空间注意的视觉跟踪层次转换器","authors":"Zhixue Liang, Wenyong Dong, Bo Zhang","doi":"10.1109/ICNSC55942.2022.10004052","DOIUrl":null,"url":null,"abstract":"Transformer-based architectures have recently witnessed significant progress in visual object tracking. However, most transformer-based trackers adopt hybrid networks, which use the convolutional neural networks (CNNs) to extract the features and the transformers to fuse and enhance them. Furthermore, most of transformer-based trackers only consider spatial dependencies between the target object and the search region, but ignore temporal relations. Simultaneously considered the temporal and spatial properties inherent in video sequences, this paper presents a hierarchical transformer with temporal memory and spatial attention network for visual tracking, named HTransT ++. The proposed network employs a hierarchical transformer as the backbone to extract multi-level features. By adopting transformer-based encoder and decoder to fuse historic template features and search region image features, the spatial and temporal dependencies across video frames are captured in tracking. Extensive experiments show that our proposed method (HTransT ++) achieves outstanding performance on four visual tracking benchmarks, including VOT2018, GOT-10K, TrackingNet, and LaSOT, while running at real-time speed.","PeriodicalId":230499,"journal":{"name":"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HTransT++: Hierarchical Transformer with Temporal Memory and Spatial Attention for Visual Tracking\",\"authors\":\"Zhixue Liang, Wenyong Dong, Bo Zhang\",\"doi\":\"10.1109/ICNSC55942.2022.10004052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer-based architectures have recently witnessed significant progress in visual object tracking. However, most transformer-based trackers adopt hybrid networks, which use the convolutional neural networks (CNNs) to extract the features and the transformers to fuse and enhance them. Furthermore, most of transformer-based trackers only consider spatial dependencies between the target object and the search region, but ignore temporal relations. Simultaneously considered the temporal and spatial properties inherent in video sequences, this paper presents a hierarchical transformer with temporal memory and spatial attention network for visual tracking, named HTransT ++. The proposed network employs a hierarchical transformer as the backbone to extract multi-level features. By adopting transformer-based encoder and decoder to fuse historic template features and search region image features, the spatial and temporal dependencies across video frames are captured in tracking. Extensive experiments show that our proposed method (HTransT ++) achieves outstanding performance on four visual tracking benchmarks, including VOT2018, GOT-10K, TrackingNet, and LaSOT, while running at real-time speed.\",\"PeriodicalId\":230499,\"journal\":{\"name\":\"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNSC55942.2022.10004052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNSC55942.2022.10004052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
HTransT++: Hierarchical Transformer with Temporal Memory and Spatial Attention for Visual Tracking
Transformer-based architectures have recently witnessed significant progress in visual object tracking. However, most transformer-based trackers adopt hybrid networks, which use the convolutional neural networks (CNNs) to extract the features and the transformers to fuse and enhance them. Furthermore, most of transformer-based trackers only consider spatial dependencies between the target object and the search region, but ignore temporal relations. Simultaneously considered the temporal and spatial properties inherent in video sequences, this paper presents a hierarchical transformer with temporal memory and spatial attention network for visual tracking, named HTransT ++. The proposed network employs a hierarchical transformer as the backbone to extract multi-level features. By adopting transformer-based encoder and decoder to fuse historic template features and search region image features, the spatial and temporal dependencies across video frames are captured in tracking. Extensive experiments show that our proposed method (HTransT ++) achieves outstanding performance on four visual tracking benchmarks, including VOT2018, GOT-10K, TrackingNet, and LaSOT, while running at real-time speed.