基于变压器的网络与自适应空间先验，用于视觉跟踪

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2024-11-07 DOI:10.1016/j.neucom.2024.128821

Feng Cheng , Gaoliang Peng , Junbao Li , Benqi Zhao , Jeng-Shyang Pan , Hang Li

{"title":"基于变压器的网络与自适应空间先验，用于视觉跟踪","authors":"Feng Cheng , Gaoliang Peng , Junbao Li , Benqi Zhao , Jeng-Shyang Pan , Hang Li","doi":"10.1016/j.neucom.2024.128821","DOIUrl":null,"url":null,"abstract":"<div><div>Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128821"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Transformer-based network with adaptive spatial prior for visual tracking\",\"authors\":\"Feng Cheng , Gaoliang Peng , Junbao Li , Benqi Zhao , Jeng-Shyang Pan , Hang Li\",\"doi\":\"10.1016/j.neucom.2024.128821\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"614 \",\"pages\":\"Article 128821\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224015923\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015923","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

复杂场景中的单个物体跟踪（SOT）是计算机视觉领域的重大挑战。近年来，变换器由于能够捕捉图像像素之间的长距离依赖关系，在视觉物体跟踪任务中显示出了明显的功效。然而，两个局限性阻碍了基于变换器的跟踪器性能的提高。首先，变换器会将图像分割成一系列斑块，从而破坏了物体的内部结构信息。其次，基于变换器的跟踪器将目标模板和搜索区域编码在一起，在特征交互过程中可能导致目标和背景的混淆。为了解决上述问题，我们提出了一种通过学习结构先验信息的完全基于变换器的跟踪框架，称为 SPformer。换句话说，我们建立了一个自注意力空间先验生成网络，用于模拟特征之间的空间关联。此外，还开发了基于高斯分布和任意分布的交叉注意结构先验提取器，以寻求物体模板和搜索区域之间的语义交互特征，从而有效缓解特征混淆。在八个主流基准上进行的广泛实验表明，SPformer 的性能优于现有的先进（SOAT）跟踪器。我们进一步分析了所提出的两个先验模块的有效性，并验证了它们在目标跟踪模型中的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Transformer-based network with adaptive spatial prior for visual tracking

Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.