基于变压器的网络与自适应空间先验,用于视觉跟踪

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Feng Cheng , Gaoliang Peng , Junbao Li , Benqi Zhao , Jeng-Shyang Pan , Hang Li
{"title":"基于变压器的网络与自适应空间先验,用于视觉跟踪","authors":"Feng Cheng ,&nbsp;Gaoliang Peng ,&nbsp;Junbao Li ,&nbsp;Benqi Zhao ,&nbsp;Jeng-Shyang Pan ,&nbsp;Hang Li","doi":"10.1016/j.neucom.2024.128821","DOIUrl":null,"url":null,"abstract":"<div><div>Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128821"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Transformer-based network with adaptive spatial prior for visual tracking\",\"authors\":\"Feng Cheng ,&nbsp;Gaoliang Peng ,&nbsp;Junbao Li ,&nbsp;Benqi Zhao ,&nbsp;Jeng-Shyang Pan ,&nbsp;Hang Li\",\"doi\":\"10.1016/j.neucom.2024.128821\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"614 \",\"pages\":\"Article 128821\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224015923\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015923","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

复杂场景中的单个物体跟踪(SOT)是计算机视觉领域的重大挑战。近年来,变换器由于能够捕捉图像像素之间的长距离依赖关系,在视觉物体跟踪任务中显示出了明显的功效。然而,两个局限性阻碍了基于变换器的跟踪器性能的提高。首先,变换器会将图像分割成一系列斑块,从而破坏了物体的内部结构信息。其次,基于变换器的跟踪器将目标模板和搜索区域编码在一起,在特征交互过程中可能导致目标和背景的混淆。为了解决上述问题,我们提出了一种通过学习结构先验信息的完全基于变换器的跟踪框架,称为 SPformer。换句话说,我们建立了一个自注意力空间先验生成网络,用于模拟特征之间的空间关联。此外,还开发了基于高斯分布和任意分布的交叉注意结构先验提取器,以寻求物体模板和搜索区域之间的语义交互特征,从而有效缓解特征混淆。在八个主流基准上进行的广泛实验表明,SPformer 的性能优于现有的先进(SOAT)跟踪器。我们进一步分析了所提出的两个先验模块的有效性,并验证了它们在目标跟踪模型中的应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Transformer-based network with adaptive spatial prior for visual tracking
Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信