Cognitive Disentanglement for Referring Multi-Object Tracking

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-05-31 DOI:10.1016/j.inffus.2025.103349

Shaofeng Liang , Runwei Guan , Wangwang Lian , Daizong Liu , Xiaolou Sun , Dongming Wu , Yutao Yue , Weiping Ding , Hui Xiong

{"title":"Cognitive Disentanglement for Referring Multi-Object Tracking","authors":"Shaofeng Liang , Runwei Guan , Wangwang Lian , Daizong Liu , Xiaolou Sun , Dongming Wu , Yutao Yue , Weiping Ding , Hui Xiong","doi":"10.1016/j.inffus.2025.103349","DOIUrl":null,"url":null,"abstract":"<div><div>As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the ”what” and ”where” pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"124 ","pages":"Article 103349"},"PeriodicalIF":15.5000,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525004221","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the ”what” and ”where” pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.

查看原文本刊更多论文

参考多目标跟踪的认知解纠缠

参考多目标跟踪（RMOT）是多源信息融合在智能交通感知系统中的重要应用，它基于语言参考对视频序列中的特定目标进行定位和跟踪。然而，现有的RMOT方法往往将语言描述视为整体嵌入，难以有效地将语言表达中包含的丰富语义信息与视觉特征相结合。这种限制在需要全面理解静态对象属性和空间运动信息的复杂场景中尤为明显。在本文中，我们提出了一个用于参考多目标跟踪（CDRMT）的认知解纠缠框架来解决这些挑战。它将人类视觉处理系统中的“是什么”和“在哪里”的路径适应于远程操作任务。具体来说，我们的框架首先建立了跨模态连接，同时保留了特定于模态的特征。然后，它分解语言描述并分层地将它们注入对象查询，将对象理解从粗粒度细化到细粒度语义级别。最后，基于视觉特征重构语言表征，确保跟踪对象忠实地反映引用表达式。在不同的基准数据集上进行的大量实验表明，CDRMT比最先进的方法取得了显著的改进，在reference - kitti上的HOTA分数平均提高了6.0%，在reference - kitti - v2上的平均提高了3.2%。我们的方法推进了RMOT的最新技术，同时为多源信息融合提供了新的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.