Yu Weng;Wenbin He;Jun Dong;Chaomurilige;Xuan Liu;Zheng Liu
{"title":"Cross-Lingual Adaptation for Vision-Language Model via Multimodal Semantic Distillation","authors":"Yu Weng;Wenbin He;Jun Dong;Chaomurilige;Xuan Liu;Zheng Liu","doi":"10.1109/TMM.2025.3557678","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557678","url":null,"abstract":"Large Multimodal Models (LMMs) excel in English multimedia tasks but face challenges in adapting to other languages due to linguistic diversity, limited non-English multimodal data, and high training costs. Existing approaches rely on machine-translated multimodal corpora or multilingual large language models, yet they demand substantial resources and achieve only modest zero-shot cross-lingual transfer performance, as shown in the IGLUE benchmark. In this work, we propose SMSA, a Syntax-aware Multimodal Semantic Adaptation approach, which efficiently extends vision-language models (VLMs) to multiple languages via a lightweight adaptation module. Instead of learning from scratch, SMSA transfers multimodal knowledge from English-trained models using two key components: (1) a Syntax-aware Adapter (SAA), which restructures multilingual text representations to align better with English syntax, reducing cross-lingual misalignment; (2) a Multimodal Semantic Distillation (MSD) method, which enables the model to mimic English sequence processing and retain multimodal associations across languages. This allows efficient adaptation to new languages while preserving the original model's strong multimodal capabilities. We extend an MoE-based VLM to 8 languages using a small translation dataset. Evaluations on the IGLUE benchmark show that SMSA achieves strong zero-shot transfer, outperforming some multilingual LMMs and demonstrating its effectiveness in cross-lingual vision-language adaptation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3184-3196"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shidong Cao;Zhonghan Zhao;Shengyu Hao;Wenhao Chai;Jenq-Neng Hwang;Hongwei Wang;Gaoang Wang
{"title":"Efficient Transfer From Image-Based Large Multimodal Models to Video Tasks","authors":"Shidong Cao;Zhonghan Zhao;Shengyu Hao;Wenhao Chai;Jenq-Neng Hwang;Hongwei Wang;Gaoang Wang","doi":"10.1109/TMM.2025.3557692","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557692","url":null,"abstract":"Extending image-based Large Multimodal Models (LMMs) to video-based LMMs always requires temporal modeling in the pre-training. However, training the temporal modules gradually erases the knowledge of visual features learned from various image-text-based scenarios, leading to degradation in some downstream tasks. To address this issue, in this paper, we introduce a novel, efficient transfer approach termed MTransLLAMA, which employs transfer learning from pre-trained image LMMs for fine-grained video tasks with only small-scale training sets. Our method enables <bold>fewer trainable parameters</b> and achieves <bold>faster adaptation</b> and <bold>higher accuracy</b> than pre-training video-based LMM models. Specifically, our method adopts early fusion between textual and visual features to capture fine-grained information, reuses spatial attention weights in temporal attentions for cyclical spatial-temporal reasoning, and introduces dynamic attention routing to capture both global and local information in spatial-temporal attentions. Experiments demonstrate that across multiple datasets and tasks, without relying on video pre-training, our model achieves state-of-the-art performance, enabling lightweight and efficient transfer from image-based LMMs to fine-grained video tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3045-3056"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HA-FGOVD: Highlighting Fine-Grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection","authors":"Yuqi Ma;Mengyin Liu;Chao Zhu;Xu-Cheng Yin","doi":"10.1109/TMM.2025.3557624","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557624","url":null,"abstract":"Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. Despite being pretrained on large-scale image-text pairs with rich attribute information, their latent feature space does not highlight these fine-grained attributes. In this paper, we introduce HA-FGOVD, a universal and explicit method that enhances the attribute-level detection capabilities of frozen OVD models by highlighting fine-grained attributes in explicit linear space. Our approach uses a LLM to extract attribute words in input text as a zero-shot task. Then, token attention masks are adjusted to guide text encoders in extracting both global and attribute-specific features, which are explicitly composited as two vectors in linear space to form a new attribute-highlighted feature for detection tasks. The composition weight scalars can be learned or transferred across different OVD models, showcasing the universality of our method. Experimental results show that HA-FGOVD achieves state-of-the-art performance on the FG-OVD benchmark and demonstrates promising generalization on the OVDEval benchmark, suggesting that our method addresses significant limitations in fine-grained attribute detection and has potential for broader fine-grained detection applications.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3171-3183"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Run Li;Dawei Zhang;Yanchao Wang;Yunliang Jiang;Zhonglong Zheng;Sang-Woon Jeon;Hua Wang
{"title":"Open-Vocabulary Multi-Object Tracking With Domain Generalized and Temporally Adaptive Features","authors":"Run Li;Dawei Zhang;Yanchao Wang;Yunliang Jiang;Zhonglong Zheng;Sang-Woon Jeon;Hua Wang","doi":"10.1109/TMM.2025.3557619","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557619","url":null,"abstract":"Open-vocabulary multi-object tracking (OVMOT) is a cutting research direction within the multi-object tracking field. It employs large multi-modal models to effectively address the challenge of tracking unseen objects within dynamic visual scenes. While models require robust domain generalization and temporal adaptability, OVTrack, the only existing open-vocabulary multi-object tracker, relies solely on static appearance information and lacks these crucial adaptive capabilities. In this paper, we propose OVSORT, a new framework designed to improve domain generalization and temporal information processing. Specifically, we first propose the Adaptive Contextual Normalization (ACN) technique in OVSORT, which dynamically adjusts the feature maps based on the dataset's statistical properties, thereby fine-tuning our model's to improve domain generalization. Then, we introduce motion cues for the first time. Using our Joint Motion and Appearance Tracking (JMAT) strategy, we obtain a joint similarity measure and subsequently apply the Hungarian algorithm for data association. Finally, our Hierarchical Adaptive Feature Update (HAFU) strategy adaptively adjusts feature updates according to the current state of each trajectory, which greatly improves the utilization of temporal information. Extensive experiments on the TAO validation set and test set confirm the superiority of OVSORT, which significantly improves the handling of novel and base classes. It surpasses existing methods in terms of accuracy and generalization, setting a new state-of-the-art for OVMOT.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3009-3022"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144177411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-View User Preference Modeling for Personalized Text-to-Image Generation","authors":"Huaiwen Zhang;Tianci Wu;Yinwei Wei","doi":"10.1109/TMM.2025.3557683","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557683","url":null,"abstract":"Personalized text-to-image generation aims to synthesize images tailored to individual user preferences. Existing methods primarily generate customized content using a few reference images, which often struggle to mine user preferences from historical records, and thus fail to synthesize truly personalized content. In addition, it is difficult to directly incorporate the extracted feature of user preferences into the feature space of the generation model, since there exists a considerable gap between them. In this paper, we propose a novel multi-view personalized text-to-image generation method based on the diffusion model, named MVP-Diffusion, which learns instance- and user-level preferences from historical records and integrates them into the generation model. For instance-level user preference modeling, we employ a chain-of-thought prompting strategy to deduce preference keywords and integrate them into input prompts with the aid of a large language model. For user-level preference modeling, we construct a learnable embedding for each user to capture more comprehensive preferences by analyzing their historical records. An adaptive user preference fusion module is proposed to inject user preferences into the generation model via a set of learnable parameters. Experimental results demonstrate that the proposed method significantly enhances the personalization of the generated images compared to the other personalized text-to-image generation methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3082-3091"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144177413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Target Pose Estimation and Behavior Analysis Based on Symmetric Cascaded AdderNet","authors":"Xiaoshuo Jia;Qingzhen Xu;Aiqing Zhu;Xiaomei Kuang","doi":"10.1109/TMM.2025.3557614","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557614","url":null,"abstract":"In the tasks of pose estimation and behavior analysis in computer vision, conventional models are often constrained by various factors or complex environments (such as multiple targets, small targets, occluded targets, etc.). To address this problem, this paper proposes a symmetric cascaded additive network (MulAG) to improve the accuracy of posture estimation and behavior analysis in complex environments. MulAG consists of two modules, MulA and MulG. The MulA module is designed based on a cascaded symmetric network structure and incorporates the addition operation. MulA extracts the posture spatial features of the target from a single frame image. And, the MulG module is designed based on three continuous GRUs (gated recurrent unit). Based on the MulA, MulG extracts the posture temporal features from the posture spatial features of the moving target and predicts the posture temporal features of the moving target. The paper firstly demonstrates the feasibility of addition operations in pose estimation tasks by comparing with MobileNet-v3 in ablation experiments. Secondly, on the HiEve and CrowdPose datasets, MulA achieves accuracy of 79.6% and 80.4%, respectively, outperforming the PTM model by 12.0% and 21.2%. Detection speed of MulA achieves the best value at 8.6 ms, which is 1 times higher than HDGCN. The result demonstrates the effectiveness of MulA in multi-target pose estimation in complex scenes. Finally, on the HDMB-51 and UCF-101 datasets, MulAG achieves accuracy of 74.8% and 86.3%, respectively, outperforming HDGCN by 9.6% and 9.5%. Compared with SKP and GIST, the fps of MulAG (44.8 s<sup>−1</sup>) is improved by 8.2% and 8.9%. These experiments highlight the generalizability and superiority of MulAG in behavior analysis and pose estimation tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3197-3209"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zechu Zhang;Weilong Peng;Jinyu Wen;Keke Tang;Meie Fang;David Dagan Feng;Ping Li
{"title":"Continuous Bijection Supervised Pyramid Diffeomorphic Deformation for Learning Tooth Meshes From CBCT Images","authors":"Zechu Zhang;Weilong Peng;Jinyu Wen;Keke Tang;Meie Fang;David Dagan Feng;Ping Li","doi":"10.1109/TMM.2025.3543091","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543091","url":null,"abstract":"Accurate and high-quality tooth mesh generation from cone-beam computerized tomography (CBCT) is an essential computer-aided technology for digital dentistry. However, existing segmentation-based methods require complicated post-processing and significant manual correction to generate regular tooth meshes. In this paper, we propose a method of continuous bijection supervised pyramid diffeomorphic deformation (PDD) for learning tooth meshes, which could be used to directly generate high-quality tooth meshes from CBCT Images. Overall, we adopt a classic two-stage framework. In the first stage, we devise an enhanced detector to accurately locate and crop every tooth. In the second stage, a PDD network is designed to deform a sphere mesh from low resolution to high one according to pyramid flows based on diffeomorphic mesh deformations, so that the generated mesh approximates the ground truth infinitely and efficiently. To achieve that, a novel continuous bijection distance loss on the diffeomorphic sphere is also designed to supervise the deformation learning, which overcomes the shortcoming of loss based on nearest-neighbour mapping and improves the fitting precision. Experiments show that our method outperforms the state-of-the-art methods in terms of both different evaluation metrics and the geometry quality of reconstructed tooth surfaces.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5696-5708"},"PeriodicalIF":9.7,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Secure Neural Network Watermarking Protocol Against Evidence Exposure Attack","authors":"Huixin Luo;Li Li;Xinpeng Zhang","doi":"10.1109/TMM.2025.3542975","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542975","url":null,"abstract":"Trigger-based backdoor watermarking is an extensively utilized and effective method to safeguard the copyright of deep neural networks (DNNs), in which the trigger set could be taken as the key of the watermark. However, during the verification stage, there is a risk that the trigger set could be leaked and exposed to adversaries. If this occurs, the adversaries might apply this leaked trigger set to claim ownership of the model, posing significant copyright issues for the watermarked DNN. To address such an evidence exposure problem, a secure neural network watermarking protocol is put forward in this paper. In the proposed protocol, the trigger set is not fixed, once the trigger is utilized for verification, it is invalid and cannot be used for verification in the future. As a result, even if the trigger set is leaked during the verification process and obtained by the attacker, they cannot use it for copyright verification since it is invalid. To assist the protocol, a trigger set generation method is designed, in which the auxiliary classifier generative adversarial network (ACGAN) and the target classification model are trained together. The special logits distribution and the labels of the generated trigger samples can be ensured and verified effectively in this way. The performance of the trigger generation methods regarding effectiveness, fidelity, and robustness is verified by experiments, and the security analysis of the designed watermarking protocol is conducted.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5563-5574"},"PeriodicalIF":9.7,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Error-Aware Generative Reasoning for Zero-Shot Visual Grounding","authors":"Yuqi Bu;Xin Wu;Yi Cai;Qiong Liu;Tao Wang;Qingbao Huang","doi":"10.1109/TMM.2025.3543062","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543062","url":null,"abstract":"Zero-shot visual grounding is the task of identifying and localizing an object in an image based on a referring expression without task-specific training. Existing methods employ heuristic rules to step-by-step perform visual perception for visual grounding. Despite their remarkable performance, there are still two limitations. First, such a rule-based manner struggles with expressions that are not covered by predefined rules. Second, existing methods lack a mechanism for identifying and correcting visual perceptual errors of incomplete information, resulting in cascading errors caused by reasoning based on incomplete visual perception results. In this article, we propose an Error-Aware Generative Reasoning (EAGR) method for zero-shot visual grounding. To address the limited adaptability of existing methods, a reasoning chain generator is presented, which prompts LLMs to dynamically generate reasoning chains for specific referring expressions. This generative manner eliminates the reliance on human-written heuristic rules. To mitigate visual perceptual errors of incomplete information, an error-aware mechanism is presented to elicit LLMs to identify these errors and explore correction strategies. Experimental results on four benchmarks show that EAGR outperforms state-of-the-art zero-shot methods by up to 10% and an average of 7%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4844-4855"},"PeriodicalIF":9.7,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}