Hong Zhu;Qingyang Lu;Lei Xue;Pingping Zhang;Guanglin Yuan
{"title":"Vision-Language Tracking With CLIP and Interactive Prompt Learning","authors":"Hong Zhu;Qingyang Lu;Lei Xue;Pingping Zhang;Guanglin Yuan","doi":"10.1109/TITS.2024.3520103","DOIUrl":null,"url":null,"abstract":"Vision-language tracking is a new rising topic in intelligent transportation systems, particularly significant in autonomous driving and road surveillance. It is a task that aims to combine visual and auxiliary linguistic modalities to co-locate the target object in a video sequence. Currently, multi-modal data scarcity and burdensome modality fusion have become two major factors in limiting the development of vision-language tracking. To tackle the issues, we propose an efficient and effective one-stage vision-language tracking framework (CPIPTrack) that unifies feature extraction and multi-modal fusion by interactive prompt learning. Feature extraction is performed by the high-performance vision-language foundation model CLIP, resulting in the impressive generalization ability inherited from the large-scale model. Modality fusion is simplified to a few lightweight prompts, leading to significant savings in computational resources. Specifically, we design three types of prompts to dynamically learn the layer-wise feature relationships between vision and language, facilitating rich context interactions while enabling the pre-trained CLIP adaptation. In this manner, discriminative target-oriented visual features can be extracted by language and template guidance, which are used for subsequent reasoning. Due to the elimination of extra heavy modality fusion, the proposed CPIPTrack shows high efficiency in both training and inference. CPIPTrack has been extensively evaluated on three benchmark datasets, and the experimental results demonstrate that it achieves a good performance-speed balance with an AUC of 66.0% on LaSOT and a runtime of 51.7 FPS on RTX2080 Super.","PeriodicalId":13416,"journal":{"name":"IEEE Transactions on Intelligent Transportation Systems","volume":"26 3","pages":"3659-3670"},"PeriodicalIF":7.9000,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Intelligent Transportation Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10817474/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, CIVIL","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-language tracking is a new rising topic in intelligent transportation systems, particularly significant in autonomous driving and road surveillance. It is a task that aims to combine visual and auxiliary linguistic modalities to co-locate the target object in a video sequence. Currently, multi-modal data scarcity and burdensome modality fusion have become two major factors in limiting the development of vision-language tracking. To tackle the issues, we propose an efficient and effective one-stage vision-language tracking framework (CPIPTrack) that unifies feature extraction and multi-modal fusion by interactive prompt learning. Feature extraction is performed by the high-performance vision-language foundation model CLIP, resulting in the impressive generalization ability inherited from the large-scale model. Modality fusion is simplified to a few lightweight prompts, leading to significant savings in computational resources. Specifically, we design three types of prompts to dynamically learn the layer-wise feature relationships between vision and language, facilitating rich context interactions while enabling the pre-trained CLIP adaptation. In this manner, discriminative target-oriented visual features can be extracted by language and template guidance, which are used for subsequent reasoning. Due to the elimination of extra heavy modality fusion, the proposed CPIPTrack shows high efficiency in both training and inference. CPIPTrack has been extensively evaluated on three benchmark datasets, and the experimental results demonstrate that it achieves a good performance-speed balance with an AUC of 66.0% on LaSOT and a runtime of 51.7 FPS on RTX2080 Super.
期刊介绍:
The theoretical, experimental and operational aspects of electrical and electronics engineering and information technologies as applied to Intelligent Transportation Systems (ITS). Intelligent Transportation Systems are defined as those systems utilizing synergistic technologies and systems engineering concepts to develop and improve transportation systems of all kinds. The scope of this interdisciplinary activity includes the promotion, consolidation and coordination of ITS technical activities among IEEE entities, and providing a focus for cooperative activities, both internally and externally.