SemTG-Track: Multimodal fine-grained semantic-unit temporal guidance for multi-object tracking

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-05-29 DOI:10.1016/j.eswa.2025.128359

Kai Ren , Chuanping Hu , Hao Xi , Yongqiang Li , Jinhao Fan , Lihua Liu

{"title":"SemTG-Track: Multimodal fine-grained semantic-unit temporal guidance for multi-object tracking","authors":"Kai Ren , Chuanping Hu , Hao Xi , Yongqiang Li , Jinhao Fan , Lihua Liu","doi":"10.1016/j.eswa.2025.128359","DOIUrl":null,"url":null,"abstract":"<div><div>In multi-object tracking (MOT) tasks, maintaining long-term identity consistency of targets in complex scenes remains a challenging research problem. Traditional prediction methods based on visual appearance features and motion trajectories struggle to dynamically and continuously preserve the unique representation of targets in complex environments. This limitation leads to tracking drift and identity confusion when targets undergo occlusion, blurring, or changes in scene dynamics and motion patterns, significantly degrading tracking performance. To address this issue, we propose a novel approach, SemTG-Track, which links the same target through cross-modal semantic information. By integrating a vision-language model with a hybrid LoRA expert system, our method enhances tracking performance through fine-grained modality alignment and dynamic semantic matching strategies. The SemTG-Track framework consists of three core modules: Semantic-unit Temporal Completeness Generation (STCG), Heterogeneous Semantic Representation Alignment (HSRA), and Temporal Sampling and Dynamic Matching (TSDM). Specifically, the STCG module leverages a vision-language model to generate a rich semantic knowledge graph for targets, the HSRA module enhances the generalization capability of semantic units through a dual-domain expert semantic fusion mechanism, and the TSDM module improves the efficiency and accuracy of multi-object tracking via dynamic sampling and context-aware matching mechanisms. Experimental results demonstrate that the proposed method outperforms baseline approaches, achieving improvements of 2.0 and 4.1 percentage points in MOTA and HOTA, respectively, on the MOT17 dataset. On the MOT20 dataset, our method also achieves gains of 0.4 and 2.2 percentage points in MOTA and HOTA, respectively, validating the effectiveness of our approach.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"289 ","pages":"Article 128359"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425019785","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In multi-object tracking (MOT) tasks, maintaining long-term identity consistency of targets in complex scenes remains a challenging research problem. Traditional prediction methods based on visual appearance features and motion trajectories struggle to dynamically and continuously preserve the unique representation of targets in complex environments. This limitation leads to tracking drift and identity confusion when targets undergo occlusion, blurring, or changes in scene dynamics and motion patterns, significantly degrading tracking performance. To address this issue, we propose a novel approach, SemTG-Track, which links the same target through cross-modal semantic information. By integrating a vision-language model with a hybrid LoRA expert system, our method enhances tracking performance through fine-grained modality alignment and dynamic semantic matching strategies. The SemTG-Track framework consists of three core modules: Semantic-unit Temporal Completeness Generation (STCG), Heterogeneous Semantic Representation Alignment (HSRA), and Temporal Sampling and Dynamic Matching (TSDM). Specifically, the STCG module leverages a vision-language model to generate a rich semantic knowledge graph for targets, the HSRA module enhances the generalization capability of semantic units through a dual-domain expert semantic fusion mechanism, and the TSDM module improves the efficiency and accuracy of multi-object tracking via dynamic sampling and context-aware matching mechanisms. Experimental results demonstrate that the proposed method outperforms baseline approaches, achieving improvements of 2.0 and 4.1 percentage points in MOTA and HOTA, respectively, on the MOT17 dataset. On the MOT20 dataset, our method also achieves gains of 0.4 and 2.2 percentage points in MOTA and HOTA, respectively, validating the effectiveness of our approach.

查看原文本刊更多论文

SemTG-Track：用于多目标跟踪的多模态细粒度语义单元时间制导

在多目标跟踪（MOT）任务中，如何保持复杂场景下目标的长期身份一致性一直是一个具有挑战性的研究问题。传统的基于视觉外观特征和运动轨迹的预测方法难以动态、连续地保持复杂环境中目标的独特表征。当目标被遮挡、模糊或场景动态和运动模式发生变化时，这种限制会导致跟踪漂移和身份混淆，从而显著降低跟踪性能。为了解决这个问题，我们提出了一种新的方法，SemTG-Track，它通过跨模态语义信息连接相同的目标。该方法将视觉语言模型与混合LoRA专家系统相结合，通过细粒度模态对齐和动态语义匹配策略提高跟踪性能。SemTG-Track框架由三个核心模块组成：语义单元时间完整性生成（STCG）、异构语义表示对齐（HSRA）和时间采样与动态匹配（TSDM）。其中，STCG模块利用视觉语言模型生成目标丰富的语义知识图谱，HSRA模块通过双域专家语义融合机制增强语义单元的泛化能力，TSDM模块通过动态采样和上下文感知匹配机制提高多目标跟踪的效率和精度。实验结果表明，该方法优于基线方法，在MOT17数据集上的MOTA和HOTA分别提高了2.0和4.1个百分点。在MOT20数据集上，我们的方法在MOTA和HOTA上也分别取得了0.4和2.2个百分点的增益，验证了我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.