Language-Guided Dual-Modal Local Correspondence for Single Object Tracking

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-08-29 DOI:10.1109/TMM.2024.3410141

Jun Yu;Zhongpeng Cai;Yihao Li;Lei Wang;Fang Gao;Ye Yu

{"title":"Language-Guided Dual-Modal Local Correspondence for Single Object Tracking","authors":"Jun Yu;Zhongpeng Cai;Yihao Li;Lei Wang;Fang Gao;Ye Yu","doi":"10.1109/TMM.2024.3410141","DOIUrl":null,"url":null,"abstract":"This paper focuses on the advancement of single-object tracking technologies in computer vision, which have broad applications including robotic vision, video surveillance, and sports video analysis. Current methods relying solely on the target's initial visual information encounter performance bottlenecks and limited applications, due to the scarcity of target semantics in appearance features and the continuous change in the target's appearance. To address these issues, we propose a novel approach, combining visual-language dual-modal single-object tracking, that leverages natural language descriptions to enrich the semantic information of the moving target. We introduce a dual-modal single-object tracking algorithm based on local correspondence modeling. The algorithm decomposes visual features into multiple local visual semantic features and pairs them with local language features extracted from natural language descriptions. In addition, we also propose a new global relocalization method that utilizes visual language bimodal information to perceive target disappearance and misalignment and adaptively reposition the target in the entire image. This improves the tracker's ability to adapt to changes in target appearance over long periods of time, enabling long-term single target tracking based on bimodal semantic and motion information. Experimental results show that our model outperforms state-of-the-art methods, which demonstrates the effectiveness and efficiency of our approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10637-10650"},"PeriodicalIF":8.4000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10659157/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

This paper focuses on the advancement of single-object tracking technologies in computer vision, which have broad applications including robotic vision, video surveillance, and sports video analysis. Current methods relying solely on the target's initial visual information encounter performance bottlenecks and limited applications, due to the scarcity of target semantics in appearance features and the continuous change in the target's appearance. To address these issues, we propose a novel approach, combining visual-language dual-modal single-object tracking, that leverages natural language descriptions to enrich the semantic information of the moving target. We introduce a dual-modal single-object tracking algorithm based on local correspondence modeling. The algorithm decomposes visual features into multiple local visual semantic features and pairs them with local language features extracted from natural language descriptions. In addition, we also propose a new global relocalization method that utilizes visual language bimodal information to perceive target disappearance and misalignment and adaptively reposition the target in the entire image. This improves the tracker's ability to adapt to changes in target appearance over long periods of time, enabling long-term single target tracking based on bimodal semantic and motion information. Experimental results show that our model outperforms state-of-the-art methods, which demonstrates the effectiveness and efficiency of our approach.

查看原文本刊更多论文

用于单个物体跟踪的语言引导双模局部对应技术

本文重点探讨计算机视觉中单目标跟踪技术的发展，该技术应用广泛，包括机器人视觉、视频监控和体育视频分析等。由于外观特征中目标语义的稀缺性和目标外观的持续变化，目前仅依赖目标初始视觉信息的方法遇到了性能瓶颈和应用限制。为了解决这些问题，我们提出了一种结合视觉语言双模态单目标跟踪的新方法，利用自然语言描述来丰富移动目标的语义信息。我们引入了一种基于局部对应建模的双模态单目标跟踪算法。该算法将视觉特征分解为多个局部视觉语义特征，并将它们与从自然语言描述中提取的局部语言特征配对。此外，我们还提出了一种新的全局重新定位方法，该方法利用视觉语言双模信息来感知目标消失和错位，并在整个图像中自适应地重新定位目标。这提高了跟踪器在长时间内适应目标外观变化的能力，实现了基于双模语义和运动信息的长期单一目标跟踪。实验结果表明，我们的模型优于最先进的方法，这证明了我们方法的有效性和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.