A Two-Stage Foveal Vision Tracker Based on Transformer Model

IF 5 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Cognitive and Developmental Systems Pub Date : 2024-03-18 DOI:10.1109/TCDS.2024.3377642

Guang Han;Jianshu Ma;Ziyang Li;Haitao Zhao

{"title":"A Two-Stage Foveal Vision Tracker Based on Transformer Model","authors":"Guang Han;Jianshu Ma;Ziyang Li;Haitao Zhao","doi":"10.1109/TCDS.2024.3377642","DOIUrl":null,"url":null,"abstract":"With the development of transformer visual models, attention-based trackers have shown highly competitive performance in the field of object tracking. However, in some tracking scenarios, especially those with multiple similar objects, the performance of existing trackers is often not satisfactory. In order to improve the performance of trackers in such scenarios, inspired by the fovea vision structure and its visual characteristics, this article proposes a novel foveal vision tracker (FVT). FVT combines the process of human eye fixation and object tracking, pruning based on the distance to the object rather than attention scores. This pruning method allows the receptive field of the feature extraction network to focus on the object, excluding background interference. FVT divides the feature extraction network into two stages: local and global, and introduces the local recursive module (LRM) and the view elimination module (VEM). LRM is used to enhance foreground features in the local stage, while VEM generates circular fovea-like visual field masks in the global stage and prunes tokens outside the mask, guiding the model to focus attention on high-information regions of the object. Experimental results on multiple object tracking datasets demonstrate that the proposed FVT achieves stronger object discrimination capability in the feature extraction stage, improves tracking accuracy and robustness in complex scenes, and achieves a significant accuracy improvement with an area overlap (AO) of 72.6% on the generic object tracking (GOT)-10k dataset.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 4","pages":"1575-1588"},"PeriodicalIF":5.0000,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive and Developmental Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10472720/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

With the development of transformer visual models, attention-based trackers have shown highly competitive performance in the field of object tracking. However, in some tracking scenarios, especially those with multiple similar objects, the performance of existing trackers is often not satisfactory. In order to improve the performance of trackers in such scenarios, inspired by the fovea vision structure and its visual characteristics, this article proposes a novel foveal vision tracker (FVT). FVT combines the process of human eye fixation and object tracking, pruning based on the distance to the object rather than attention scores. This pruning method allows the receptive field of the feature extraction network to focus on the object, excluding background interference. FVT divides the feature extraction network into two stages: local and global, and introduces the local recursive module (LRM) and the view elimination module (VEM). LRM is used to enhance foreground features in the local stage, while VEM generates circular fovea-like visual field masks in the global stage and prunes tokens outside the mask, guiding the model to focus attention on high-information regions of the object. Experimental results on multiple object tracking datasets demonstrate that the proposed FVT achieves stronger object discrimination capability in the feature extraction stage, improves tracking accuracy and robustness in complex scenes, and achieves a significant accuracy improvement with an area overlap (AO) of 72.6% on the generic object tracking (GOT)-10k dataset.

查看原文本刊更多论文

基于变压器模型的两级眼窝视觉跟踪器

随着变换器视觉模型的发展，基于注意力的跟踪器在物体跟踪领域显示出了极具竞争力的性能。然而，在某些跟踪场景中，特别是在有多个相似物体的场景中，现有跟踪器的性能往往不能令人满意。为了提高跟踪器在这类场景中的性能，本文受眼窝视觉结构及其视觉特性的启发，提出了一种新型眼窝视觉跟踪器（FVT）。FVT 结合了人眼固定和物体跟踪的过程，根据到物体的距离而不是注意力分数进行剪枝。这种剪枝方法能让特征提取网络的感受野聚焦于目标，排除背景干扰。FVT 将特征提取网络分为两个阶段：局部和全局，并引入了局部递归模块（LRM）和视图消除模块（VEM）。局部递归模块用于增强局部阶段的前景特征，而全局递归模块则在全局阶段生成类似眼窝的圆形视场遮罩，并剪除遮罩外的标记，引导模型将注意力集中在物体的高信息区域。在多个物体跟踪数据集上的实验结果表明，所提出的 FVT 在特征提取阶段实现了更强的物体识别能力，提高了复杂场景中的跟踪精度和鲁棒性，并在通用物体跟踪（GOT）-10k 数据集上实现了显著的精度提高，区域重叠率（AO）达到 72.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cognitive and Developmental Systems Computer Science-Software

CiteScore

7.20

自引率

10.00%

发文量

170

期刊介绍： The IEEE Transactions on Cognitive and Developmental Systems (TCDS) focuses on advances in the study of development and cognition in natural (humans, animals) and artificial (robots, agents) systems. It welcomes contributions from multiple related disciplines including cognitive systems, cognitive robotics, developmental and epigenetic robotics, autonomous and evolutionary robotics, social structures, multi-agent and artificial life systems, computational neuroscience, and developmental psychology. Articles on theoretical, computational, application-oriented, and experimental studies as well as reviews in these areas are considered.