Open-Vocabulary Multi-Object Tracking With Domain Generalized and Temporally Adaptive Features

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI:10.1109/TMM.2025.3557619

Run Li;Dawei Zhang;Yanchao Wang;Yunliang Jiang;Zhonglong Zheng;Sang-Woon Jeon;Hua Wang

{"title":"Open-Vocabulary Multi-Object Tracking With Domain Generalized and Temporally Adaptive Features","authors":"Run Li;Dawei Zhang;Yanchao Wang;Yunliang Jiang;Zhonglong Zheng;Sang-Woon Jeon;Hua Wang","doi":"10.1109/TMM.2025.3557619","DOIUrl":null,"url":null,"abstract":"Open-vocabulary multi-object tracking (OVMOT) is a cutting research direction within the multi-object tracking field. It employs large multi-modal models to effectively address the challenge of tracking unseen objects within dynamic visual scenes. While models require robust domain generalization and temporal adaptability, OVTrack, the only existing open-vocabulary multi-object tracker, relies solely on static appearance information and lacks these crucial adaptive capabilities. In this paper, we propose OVSORT, a new framework designed to improve domain generalization and temporal information processing. Specifically, we first propose the Adaptive Contextual Normalization (ACN) technique in OVSORT, which dynamically adjusts the feature maps based on the dataset's statistical properties, thereby fine-tuning our model's to improve domain generalization. Then, we introduce motion cues for the first time. Using our Joint Motion and Appearance Tracking (JMAT) strategy, we obtain a joint similarity measure and subsequently apply the Hungarian algorithm for data association. Finally, our Hierarchical Adaptive Feature Update (HAFU) strategy adaptively adjusts feature updates according to the current state of each trajectory, which greatly improves the utilization of temporal information. Extensive experiments on the TAO validation set and test set confirm the superiority of OVSORT, which significantly improves the handling of novel and base classes. It surpasses existing methods in terms of accuracy and generalization, setting a new state-of-the-art for OVMOT.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3009-3022"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948331/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Open-vocabulary multi-object tracking (OVMOT) is a cutting research direction within the multi-object tracking field. It employs large multi-modal models to effectively address the challenge of tracking unseen objects within dynamic visual scenes. While models require robust domain generalization and temporal adaptability, OVTrack, the only existing open-vocabulary multi-object tracker, relies solely on static appearance information and lacks these crucial adaptive capabilities. In this paper, we propose OVSORT, a new framework designed to improve domain generalization and temporal information processing. Specifically, we first propose the Adaptive Contextual Normalization (ACN) technique in OVSORT, which dynamically adjusts the feature maps based on the dataset's statistical properties, thereby fine-tuning our model's to improve domain generalization. Then, we introduce motion cues for the first time. Using our Joint Motion and Appearance Tracking (JMAT) strategy, we obtain a joint similarity measure and subsequently apply the Hungarian algorithm for data association. Finally, our Hierarchical Adaptive Feature Update (HAFU) strategy adaptively adjusts feature updates according to the current state of each trajectory, which greatly improves the utilization of temporal information. Extensive experiments on the TAO validation set and test set confirm the superiority of OVSORT, which significantly improves the handling of novel and base classes. It surpasses existing methods in terms of accuracy and generalization, setting a new state-of-the-art for OVMOT.

查看原文本刊更多论文

具有领域广义和时间自适应特征的开放词汇多目标跟踪

开放词汇多目标跟踪（OVMOT）是多目标跟踪领域的一个重要研究方向。它采用大型多模态模型来有效地解决在动态视觉场景中跟踪看不见的物体的挑战。虽然模型需要鲁棒的领域泛化和时间适应性，但OVTrack是现有唯一的开放词汇多目标跟踪器，仅依赖于静态外观信息，缺乏这些关键的自适应能力。在本文中，我们提出了一个新的框架OVSORT，旨在提高领域泛化和时间信息处理。具体来说，我们首先在OVSORT中提出了自适应上下文归一化（ACN）技术，该技术根据数据集的统计属性动态调整特征映射，从而微调我们的模型以提高领域泛化。然后，我们第一次引入了动作线索。使用我们的联合运动和外观跟踪（JMAT）策略，我们获得了一个联合相似性度量，随后应用匈牙利算法进行数据关联。最后，我们的分层自适应特征更新（HAFU）策略根据每条轨迹的当前状态自适应调整特征更新，大大提高了时间信息的利用率。在TAO验证集和测试集上的大量实验证实了OVSORT的优越性，它显著提高了对新类和基类的处理。它在准确性和泛化方面超越了现有的方法，为OVMOT设定了新的技术水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.