Jun Yu;Zhongpeng Cai;Yihao Li;Lei Wang;Fang Gao;Ye Yu
{"title":"用于单个物体跟踪的语言引导双模局部对应技术","authors":"Jun Yu;Zhongpeng Cai;Yihao Li;Lei Wang;Fang Gao;Ye Yu","doi":"10.1109/TMM.2024.3410141","DOIUrl":null,"url":null,"abstract":"This paper focuses on the advancement of single-object tracking technologies in computer vision, which have broad applications including robotic vision, video surveillance, and sports video analysis. Current methods relying solely on the target's initial visual information encounter performance bottlenecks and limited applications, due to the scarcity of target semantics in appearance features and the continuous change in the target's appearance. To address these issues, we propose a novel approach, combining visual-language dual-modal single-object tracking, that leverages natural language descriptions to enrich the semantic information of the moving target. We introduce a dual-modal single-object tracking algorithm based on local correspondence modeling. The algorithm decomposes visual features into multiple local visual semantic features and pairs them with local language features extracted from natural language descriptions. In addition, we also propose a new global relocalization method that utilizes visual language bimodal information to perceive target disappearance and misalignment and adaptively reposition the target in the entire image. This improves the tracker's ability to adapt to changes in target appearance over long periods of time, enabling long-term single target tracking based on bimodal semantic and motion information. Experimental results show that our model outperforms state-of-the-art methods, which demonstrates the effectiveness and efficiency of our approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10637-10650"},"PeriodicalIF":8.4000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Language-Guided Dual-Modal Local Correspondence for Single Object Tracking\",\"authors\":\"Jun Yu;Zhongpeng Cai;Yihao Li;Lei Wang;Fang Gao;Ye Yu\",\"doi\":\"10.1109/TMM.2024.3410141\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper focuses on the advancement of single-object tracking technologies in computer vision, which have broad applications including robotic vision, video surveillance, and sports video analysis. Current methods relying solely on the target's initial visual information encounter performance bottlenecks and limited applications, due to the scarcity of target semantics in appearance features and the continuous change in the target's appearance. To address these issues, we propose a novel approach, combining visual-language dual-modal single-object tracking, that leverages natural language descriptions to enrich the semantic information of the moving target. We introduce a dual-modal single-object tracking algorithm based on local correspondence modeling. The algorithm decomposes visual features into multiple local visual semantic features and pairs them with local language features extracted from natural language descriptions. In addition, we also propose a new global relocalization method that utilizes visual language bimodal information to perceive target disappearance and misalignment and adaptively reposition the target in the entire image. This improves the tracker's ability to adapt to changes in target appearance over long periods of time, enabling long-term single target tracking based on bimodal semantic and motion information. Experimental results show that our model outperforms state-of-the-art methods, which demonstrates the effectiveness and efficiency of our approach.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"26 \",\"pages\":\"10637-10650\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10659157/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10659157/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Language-Guided Dual-Modal Local Correspondence for Single Object Tracking
This paper focuses on the advancement of single-object tracking technologies in computer vision, which have broad applications including robotic vision, video surveillance, and sports video analysis. Current methods relying solely on the target's initial visual information encounter performance bottlenecks and limited applications, due to the scarcity of target semantics in appearance features and the continuous change in the target's appearance. To address these issues, we propose a novel approach, combining visual-language dual-modal single-object tracking, that leverages natural language descriptions to enrich the semantic information of the moving target. We introduce a dual-modal single-object tracking algorithm based on local correspondence modeling. The algorithm decomposes visual features into multiple local visual semantic features and pairs them with local language features extracted from natural language descriptions. In addition, we also propose a new global relocalization method that utilizes visual language bimodal information to perceive target disappearance and misalignment and adaptively reposition the target in the entire image. This improves the tracker's ability to adapt to changes in target appearance over long periods of time, enabling long-term single target tracking based on bimodal semantic and motion information. Experimental results show that our model outperforms state-of-the-art methods, which demonstrates the effectiveness and efficiency of our approach.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.