HTransT++: Hierarchical Transformer with Temporal Memory and Spatial Attention for Visual Tracking

2022 IEEE International Conference on Networking, Sensing and Control (ICNSC) Pub Date : 2022-12-15 DOI:10.1109/ICNSC55942.2022.10004052

Zhixue Liang, Wenyong Dong, Bo Zhang

{"title":"HTransT++: Hierarchical Transformer with Temporal Memory and Spatial Attention for Visual Tracking","authors":"Zhixue Liang, Wenyong Dong, Bo Zhang","doi":"10.1109/ICNSC55942.2022.10004052","DOIUrl":null,"url":null,"abstract":"Transformer-based architectures have recently witnessed significant progress in visual object tracking. However, most transformer-based trackers adopt hybrid networks, which use the convolutional neural networks (CNNs) to extract the features and the transformers to fuse and enhance them. Furthermore, most of transformer-based trackers only consider spatial dependencies between the target object and the search region, but ignore temporal relations. Simultaneously considered the temporal and spatial properties inherent in video sequences, this paper presents a hierarchical transformer with temporal memory and spatial attention network for visual tracking, named HTransT ++. The proposed network employs a hierarchical transformer as the backbone to extract multi-level features. By adopting transformer-based encoder and decoder to fuse historic template features and search region image features, the spatial and temporal dependencies across video frames are captured in tracking. Extensive experiments show that our proposed method (HTransT ++) achieves outstanding performance on four visual tracking benchmarks, including VOT2018, GOT-10K, TrackingNet, and LaSOT, while running at real-time speed.","PeriodicalId":230499,"journal":{"name":"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNSC55942.2022.10004052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer-based architectures have recently witnessed significant progress in visual object tracking. However, most transformer-based trackers adopt hybrid networks, which use the convolutional neural networks (CNNs) to extract the features and the transformers to fuse and enhance them. Furthermore, most of transformer-based trackers only consider spatial dependencies between the target object and the search region, but ignore temporal relations. Simultaneously considered the temporal and spatial properties inherent in video sequences, this paper presents a hierarchical transformer with temporal memory and spatial attention network for visual tracking, named HTransT ++. The proposed network employs a hierarchical transformer as the backbone to extract multi-level features. By adopting transformer-based encoder and decoder to fuse historic template features and search region image features, the spatial and temporal dependencies across video frames are captured in tracking. Extensive experiments show that our proposed method (HTransT ++) achieves outstanding performance on four visual tracking benchmarks, including VOT2018, GOT-10K, TrackingNet, and LaSOT, while running at real-time speed.

查看原文本刊更多论文

具有时间记忆和空间注意的视觉跟踪层次转换器

基于变压器的体系结构最近在视觉对象跟踪方面取得了重大进展。然而，大多数基于变压器的跟踪器采用混合网络，即使用卷积神经网络(cnn)提取特征并使用变压器进行融合和增强。此外，大多数基于变压器的跟踪器只考虑目标对象与搜索区域之间的空间依赖关系，而忽略了时间关系。同时考虑到视频序列固有的时间和空间特性，本文提出了一种具有时间记忆和空间注意网络的分层视觉跟踪转换器htranst++。该网络采用分层变压器作为主干来提取多层次特征。通过采用基于变换的编码器和解码器融合历史模板特征和搜索区域图像特征，在跟踪中捕获视频帧间的时空依赖关系。大量的实验表明，我们提出的方法(htranst++)在四个视觉跟踪基准上取得了出色的性能，包括VOT2018、GOT-10K、TrackingNet和LaSOT，同时以实时速度运行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)

自引率

0.00%

发文量