Feng Cheng , Gaoliang Peng , Junbao Li , Benqi Zhao , Jeng-Shyang Pan , Hang Li
{"title":"基于变压器的网络与自适应空间先验,用于视觉跟踪","authors":"Feng Cheng , Gaoliang Peng , Junbao Li , Benqi Zhao , Jeng-Shyang Pan , Hang Li","doi":"10.1016/j.neucom.2024.128821","DOIUrl":null,"url":null,"abstract":"<div><div>Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128821"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Transformer-based network with adaptive spatial prior for visual tracking\",\"authors\":\"Feng Cheng , Gaoliang Peng , Junbao Li , Benqi Zhao , Jeng-Shyang Pan , Hang Li\",\"doi\":\"10.1016/j.neucom.2024.128821\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"614 \",\"pages\":\"Article 128821\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224015923\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015923","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A Transformer-based network with adaptive spatial prior for visual tracking
Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.