Run Li;Dawei Zhang;Yanchao Wang;Yunliang Jiang;Zhonglong Zheng;Sang-Woon Jeon;Hua Wang
{"title":"具有领域广义和时间自适应特征的开放词汇多目标跟踪","authors":"Run Li;Dawei Zhang;Yanchao Wang;Yunliang Jiang;Zhonglong Zheng;Sang-Woon Jeon;Hua Wang","doi":"10.1109/TMM.2025.3557619","DOIUrl":null,"url":null,"abstract":"Open-vocabulary multi-object tracking (OVMOT) is a cutting research direction within the multi-object tracking field. It employs large multi-modal models to effectively address the challenge of tracking unseen objects within dynamic visual scenes. While models require robust domain generalization and temporal adaptability, OVTrack, the only existing open-vocabulary multi-object tracker, relies solely on static appearance information and lacks these crucial adaptive capabilities. In this paper, we propose OVSORT, a new framework designed to improve domain generalization and temporal information processing. Specifically, we first propose the Adaptive Contextual Normalization (ACN) technique in OVSORT, which dynamically adjusts the feature maps based on the dataset's statistical properties, thereby fine-tuning our model's to improve domain generalization. Then, we introduce motion cues for the first time. Using our Joint Motion and Appearance Tracking (JMAT) strategy, we obtain a joint similarity measure and subsequently apply the Hungarian algorithm for data association. Finally, our Hierarchical Adaptive Feature Update (HAFU) strategy adaptively adjusts feature updates according to the current state of each trajectory, which greatly improves the utilization of temporal information. Extensive experiments on the TAO validation set and test set confirm the superiority of OVSORT, which significantly improves the handling of novel and base classes. It surpasses existing methods in terms of accuracy and generalization, setting a new state-of-the-art for OVMOT.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3009-3022"},"PeriodicalIF":8.4000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Open-Vocabulary Multi-Object Tracking With Domain Generalized and Temporally Adaptive Features\",\"authors\":\"Run Li;Dawei Zhang;Yanchao Wang;Yunliang Jiang;Zhonglong Zheng;Sang-Woon Jeon;Hua Wang\",\"doi\":\"10.1109/TMM.2025.3557619\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Open-vocabulary multi-object tracking (OVMOT) is a cutting research direction within the multi-object tracking field. It employs large multi-modal models to effectively address the challenge of tracking unseen objects within dynamic visual scenes. While models require robust domain generalization and temporal adaptability, OVTrack, the only existing open-vocabulary multi-object tracker, relies solely on static appearance information and lacks these crucial adaptive capabilities. In this paper, we propose OVSORT, a new framework designed to improve domain generalization and temporal information processing. Specifically, we first propose the Adaptive Contextual Normalization (ACN) technique in OVSORT, which dynamically adjusts the feature maps based on the dataset's statistical properties, thereby fine-tuning our model's to improve domain generalization. Then, we introduce motion cues for the first time. Using our Joint Motion and Appearance Tracking (JMAT) strategy, we obtain a joint similarity measure and subsequently apply the Hungarian algorithm for data association. Finally, our Hierarchical Adaptive Feature Update (HAFU) strategy adaptively adjusts feature updates according to the current state of each trajectory, which greatly improves the utilization of temporal information. Extensive experiments on the TAO validation set and test set confirm the superiority of OVSORT, which significantly improves the handling of novel and base classes. It surpasses existing methods in terms of accuracy and generalization, setting a new state-of-the-art for OVMOT.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"3009-3022\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10948331/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948331/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Open-Vocabulary Multi-Object Tracking With Domain Generalized and Temporally Adaptive Features
Open-vocabulary multi-object tracking (OVMOT) is a cutting research direction within the multi-object tracking field. It employs large multi-modal models to effectively address the challenge of tracking unseen objects within dynamic visual scenes. While models require robust domain generalization and temporal adaptability, OVTrack, the only existing open-vocabulary multi-object tracker, relies solely on static appearance information and lacks these crucial adaptive capabilities. In this paper, we propose OVSORT, a new framework designed to improve domain generalization and temporal information processing. Specifically, we first propose the Adaptive Contextual Normalization (ACN) technique in OVSORT, which dynamically adjusts the feature maps based on the dataset's statistical properties, thereby fine-tuning our model's to improve domain generalization. Then, we introduce motion cues for the first time. Using our Joint Motion and Appearance Tracking (JMAT) strategy, we obtain a joint similarity measure and subsequently apply the Hungarian algorithm for data association. Finally, our Hierarchical Adaptive Feature Update (HAFU) strategy adaptively adjusts feature updates according to the current state of each trajectory, which greatly improves the utilization of temporal information. Extensive experiments on the TAO validation set and test set confirm the superiority of OVSORT, which significantly improves the handling of novel and base classes. It surpasses existing methods in terms of accuracy and generalization, setting a new state-of-the-art for OVMOT.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.