Kai Ren , Chuanping Hu , Hao Xi , Yongqiang Li , Jinhao Fan , Lihua Liu
{"title":"SemTG-Track: Multimodal fine-grained semantic-unit temporal guidance for multi-object tracking","authors":"Kai Ren , Chuanping Hu , Hao Xi , Yongqiang Li , Jinhao Fan , Lihua Liu","doi":"10.1016/j.eswa.2025.128359","DOIUrl":null,"url":null,"abstract":"<div><div>In multi-object tracking (MOT) tasks, maintaining long-term identity consistency of targets in complex scenes remains a challenging research problem. Traditional prediction methods based on visual appearance features and motion trajectories struggle to dynamically and continuously preserve the unique representation of targets in complex environments. This limitation leads to tracking drift and identity confusion when targets undergo occlusion, blurring, or changes in scene dynamics and motion patterns, significantly degrading tracking performance. To address this issue, we propose a novel approach, SemTG-Track, which links the same target through cross-modal semantic information. By integrating a vision-language model with a hybrid LoRA expert system, our method enhances tracking performance through fine-grained modality alignment and dynamic semantic matching strategies. The SemTG-Track framework consists of three core modules: Semantic-unit Temporal Completeness Generation (STCG), Heterogeneous Semantic Representation Alignment (HSRA), and Temporal Sampling and Dynamic Matching (TSDM). Specifically, the STCG module leverages a vision-language model to generate a rich semantic knowledge graph for targets, the HSRA module enhances the generalization capability of semantic units through a dual-domain expert semantic fusion mechanism, and the TSDM module improves the efficiency and accuracy of multi-object tracking via dynamic sampling and context-aware matching mechanisms. Experimental results demonstrate that the proposed method outperforms baseline approaches, achieving improvements of 2.0 and 4.1 percentage points in MOTA and HOTA, respectively, on the MOT17 dataset. On the MOT20 dataset, our method also achieves gains of 0.4 and 2.2 percentage points in MOTA and HOTA, respectively, validating the effectiveness of our approach.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"289 ","pages":"Article 128359"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425019785","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In multi-object tracking (MOT) tasks, maintaining long-term identity consistency of targets in complex scenes remains a challenging research problem. Traditional prediction methods based on visual appearance features and motion trajectories struggle to dynamically and continuously preserve the unique representation of targets in complex environments. This limitation leads to tracking drift and identity confusion when targets undergo occlusion, blurring, or changes in scene dynamics and motion patterns, significantly degrading tracking performance. To address this issue, we propose a novel approach, SemTG-Track, which links the same target through cross-modal semantic information. By integrating a vision-language model with a hybrid LoRA expert system, our method enhances tracking performance through fine-grained modality alignment and dynamic semantic matching strategies. The SemTG-Track framework consists of three core modules: Semantic-unit Temporal Completeness Generation (STCG), Heterogeneous Semantic Representation Alignment (HSRA), and Temporal Sampling and Dynamic Matching (TSDM). Specifically, the STCG module leverages a vision-language model to generate a rich semantic knowledge graph for targets, the HSRA module enhances the generalization capability of semantic units through a dual-domain expert semantic fusion mechanism, and the TSDM module improves the efficiency and accuracy of multi-object tracking via dynamic sampling and context-aware matching mechanisms. Experimental results demonstrate that the proposed method outperforms baseline approaches, achieving improvements of 2.0 and 4.1 percentage points in MOTA and HOTA, respectively, on the MOT17 dataset. On the MOT20 dataset, our method also achieves gains of 0.4 and 2.2 percentage points in MOTA and HOTA, respectively, validating the effectiveness of our approach.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.