Improving Visual Object Tracking Through Visual Prompting

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI:10.1109/TMM.2025.3535323

Shih-Fang Chen;Jun-Cheng Chen;I-Hong Jhuo;Yen-Yu Lin

{"title":"Improving Visual Object Tracking Through Visual Prompting","authors":"Shih-Fang Chen;Jun-Cheng Chen;I-Hong Jhuo;Yen-Yu Lin","doi":"10.1109/TMM.2025.3535323","DOIUrl":null,"url":null,"abstract":"Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2682-2694"},"PeriodicalIF":8.4000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855520/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.

查看原文本刊更多论文

通过视觉提示改进视觉对象跟踪

学习一种区分目标和周围干扰物的判别模型是通用视觉目标跟踪的关键。由于现有跟踪器的识别能力有限，对干扰物的动态目标表示适应具有挑战性。为了解决这个问题，我们提出了一种新的通用视觉对象跟踪（PiVOT）的视觉提示机制。PiVOT提出了一个带有预训练基础模型CLIP的提示生成网络，自动生成和细化可视化提示，实现基础模型知识的转移，便于跟踪。CLIP提供了广泛的类别级知识，而跟踪器在特定于实例的数据上进行了训练，擅长识别唯一对象实例。因此，PiVOT首先编译一个突出显示潜在目标位置的视觉提示。为了将CLIP的知识传递给跟踪器，PiVOT利用CLIP根据候选对象和跨潜在目标的参考模板之间的相似性来改进视觉提示。视觉提示经过细化后，可以更好地突出潜在的目标位置，从而减少不相关的提示信息。利用所提出的提示机制，跟踪器可以通过视觉提示的引导生成改进的实例感知特征映射，从而有效地减少干扰。该方法在训练过程中不涉及CLIP，既保持了相同的训练复杂度，又保持了预训练基础模型的泛化能力。在多个基准测试中进行的大量实验表明，使用所提出的提示方法的PiVOT可以抑制分散注意力的对象并增强跟踪器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.