IEEE Transactions on Multimedia最新文献_第3页

ExpLLM: Towards Chain of Thought for Facial Expression Recognition 面向面部表情识别的思维链

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557704

Xing Lan;Jian Xue;Ji Qi;Dongmei Jiang;Ke Lu;Tat-Seng Chua

{"title":"ExpLLM: Towards Chain of Thought for Facial Expression Recognition","authors":"Xing Lan;Jian Xue;Ji Qi;Dongmei Jiang;Ke Lu;Tat-Seng Chua","doi":"10.1109/TMM.2025.3557704","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557704","url":null,"abstract":"Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU's name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3069-3081"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Lingual Adaptation for Vision-Language Model via Multimodal Semantic Distillation 基于多模态语义蒸馏的视觉语言模型跨语言自适应

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557678

Yu Weng;Wenbin He;Jun Dong;Chaomurilige;Xuan Liu;Zheng Liu

{"title":"Cross-Lingual Adaptation for Vision-Language Model via Multimodal Semantic Distillation","authors":"Yu Weng;Wenbin He;Jun Dong;Chaomurilige;Xuan Liu;Zheng Liu","doi":"10.1109/TMM.2025.3557678","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557678","url":null,"abstract":"Large Multimodal Models (LMMs) excel in English multimedia tasks but face challenges in adapting to other languages due to linguistic diversity, limited non-English multimodal data, and high training costs. Existing approaches rely on machine-translated multimodal corpora or multilingual large language models, yet they demand substantial resources and achieve only modest zero-shot cross-lingual transfer performance, as shown in the IGLUE benchmark. In this work, we propose SMSA, a Syntax-aware Multimodal Semantic Adaptation approach, which efficiently extends vision-language models (VLMs) to multiple languages via a lightweight adaptation module. Instead of learning from scratch, SMSA transfers multimodal knowledge from English-trained models using two key components: (1) a Syntax-aware Adapter (SAA), which restructures multilingual text representations to align better with English syntax, reducing cross-lingual misalignment; (2) a Multimodal Semantic Distillation (MSD) method, which enables the model to mimic English sequence processing and retain multimodal associations across languages. This allows efficient adaptation to new languages while preserving the original model's strong multimodal capabilities. We extend an MoE-based VLM to 8 languages using a small translation dataset. Evaluations on the IGLUE benchmark show that SMSA achieves strong zero-shot transfer, outperforming some multilingual LMMs and demonstrating its effectiveness in cross-lingual vision-language adaptation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3184-3196"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Transfer From Image-Based Large Multimodal Models to Video Tasks 从基于图像的大型多模态模型到视频任务的有效转换

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557692

Shidong Cao;Zhonghan Zhao;Shengyu Hao;Wenhao Chai;Jenq-Neng Hwang;Hongwei Wang;Gaoang Wang

{"title":"Efficient Transfer From Image-Based Large Multimodal Models to Video Tasks","authors":"Shidong Cao;Zhonghan Zhao;Shengyu Hao;Wenhao Chai;Jenq-Neng Hwang;Hongwei Wang;Gaoang Wang","doi":"10.1109/TMM.2025.3557692","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557692","url":null,"abstract":"Extending image-based Large Multimodal Models (LMMs) to video-based LMMs always requires temporal modeling in the pre-training. However, training the temporal modules gradually erases the knowledge of visual features learned from various image-text-based scenarios, leading to degradation in some downstream tasks. To address this issue, in this paper, we introduce a novel, efficient transfer approach termed MTransLLAMA, which employs transfer learning from pre-trained image LMMs for fine-grained video tasks with only small-scale training sets. Our method enables <bold>fewer trainable parameters and achieves <bold>faster adaptation and <bold>higher accuracy than pre-training video-based LMM models. Specifically, our method adopts early fusion between textual and visual features to capture fine-grained information, reuses spatial attention weights in temporal attentions for cyclical spatial-temporal reasoning, and introduces dynamic attention routing to capture both global and local information in spatial-temporal attentions. Experiments demonstrate that across multiple datasets and tasks, without relying on video pre-training, our model achieves state-of-the-art performance, enabling lightweight and efficient transfer from image-based LMMs to fine-grained video tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3045-3056"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HA-FGOVD: Highlighting Fine-Grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection HA-FGOVD：通过显式线性组合突出细粒度属性用于开放词汇对象检测

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557624

Yuqi Ma;Mengyin Liu;Chao Zhu;Xu-Cheng Yin

{"title":"HA-FGOVD: Highlighting Fine-Grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection","authors":"Yuqi Ma;Mengyin Liu;Chao Zhu;Xu-Cheng Yin","doi":"10.1109/TMM.2025.3557624","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557624","url":null,"abstract":"Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. Despite being pretrained on large-scale image-text pairs with rich attribute information, their latent feature space does not highlight these fine-grained attributes. In this paper, we introduce HA-FGOVD, a universal and explicit method that enhances the attribute-level detection capabilities of frozen OVD models by highlighting fine-grained attributes in explicit linear space. Our approach uses a LLM to extract attribute words in input text as a zero-shot task. Then, token attention masks are adjusted to guide text encoders in extracting both global and attribute-specific features, which are explicitly composited as two vectors in linear space to form a new attribute-highlighted feature for detection tasks. The composition weight scalars can be learned or transferred across different OVD models, showcasing the universality of our method. Experimental results show that HA-FGOVD achieves state-of-the-art performance on the FG-OVD benchmark and demonstrates promising generalization on the OVDEval benchmark, suggesting that our method addresses significant limitations in fine-grained attribute detection and has potential for broader fine-grained detection applications.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3171-3183"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Open-Vocabulary Video Semantic Segmentation 面向开放词汇的视频语义分割

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557719

Xinhao Li;Yun Liu;Guolei Sun;Min Wu;Le Zhang;Ce Zhu

{"title":"Towards Open-Vocabulary Video Semantic Segmentation","authors":"Xinhao Li;Yun Liu;Guolei Sun;Min Wu;Le Zhang;Ce Zhu","doi":"10.1109/TMM.2025.3557719","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557719","url":null,"abstract":"Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS's zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2924-2934"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Adaptive Framework Embedded With LLM for Knowledge Graph Construction 一种嵌入LLM的自适应知识图谱构建框架

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557717

Qingwang Wang;Chaohui Li;Yi Liu;Qiubai Zhu;Jian Song;Tao Shen

{"title":"An Adaptive Framework Embedded With LLM for Knowledge Graph Construction","authors":"Qingwang Wang;Chaohui Li;Yi Liu;Qiubai Zhu;Jian Song;Tao Shen","doi":"10.1109/TMM.2025.3557717","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557717","url":null,"abstract":"Knowledge graph construction is aimed at storing and representing the knowledge of the objective world in a structured form. Existing methods for automatic construction of knowledge graphs have problems such as difficulty in understanding potential semantics and low precision. The emergence of Large Language Models (LLMs) provides an effective way for automatic knowledge graph construction. However, using LLMs as automatic knowledge graph construction engines relies on the embedding of schema layers, which brings challenges to the input length of LLMs. In this paper, we present a framework for Adaptive Construction of Knowledge Graph by leveraging the exceptional generation capabilities of LLMs and the latent relational semantic information of triples, named ACKG-LLM. Our proposed framework divides the knowledge graph construction task into three subtasks within a unified pipeline: triple extraction of open information, additional relational semantic information embedding and knowledge graph normalization based on schema-level embedding. The framework can construct knowledge graphs in different domains, making up for the defects of existing frameworks that need to retrain and fine-tune the internal model. Extensive experiments demonstrate that our proposed ACKG-LLM performs favorably against representative methods on the REBEL and WiKi-NRE datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2912-2923"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Open-Vocabulary Multi-Object Tracking With Domain Generalized and Temporally Adaptive Features 具有领域广义和时间自适应特征的开放词汇多目标跟踪

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557619

Run Li;Dawei Zhang;Yanchao Wang;Yunliang Jiang;Zhonglong Zheng;Sang-Woon Jeon;Hua Wang

{"title":"Open-Vocabulary Multi-Object Tracking With Domain Generalized and Temporally Adaptive Features","authors":"Run Li;Dawei Zhang;Yanchao Wang;Yunliang Jiang;Zhonglong Zheng;Sang-Woon Jeon;Hua Wang","doi":"10.1109/TMM.2025.3557619","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557619","url":null,"abstract":"Open-vocabulary multi-object tracking (OVMOT) is a cutting research direction within the multi-object tracking field. It employs large multi-modal models to effectively address the challenge of tracking unseen objects within dynamic visual scenes. While models require robust domain generalization and temporal adaptability, OVTrack, the only existing open-vocabulary multi-object tracker, relies solely on static appearance information and lacks these crucial adaptive capabilities. In this paper, we propose OVSORT, a new framework designed to improve domain generalization and temporal information processing. Specifically, we first propose the Adaptive Contextual Normalization (ACN) technique in OVSORT, which dynamically adjusts the feature maps based on the dataset's statistical properties, thereby fine-tuning our model's to improve domain generalization. Then, we introduce motion cues for the first time. Using our Joint Motion and Appearance Tracking (JMAT) strategy, we obtain a joint similarity measure and subsequently apply the Hungarian algorithm for data association. Finally, our Hierarchical Adaptive Feature Update (HAFU) strategy adaptively adjusts feature updates according to the current state of each trajectory, which greatly improves the utilization of temporal information. Extensive experiments on the TAO validation set and test set confirm the superiority of OVSORT, which significantly improves the handling of novel and base classes. It surpasses existing methods in terms of accuracy and generalization, setting a new state-of-the-art for OVMOT.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3009-3022"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144177411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-View User Preference Modeling for Personalized Text-to-Image Generation 个性化文本到图像生成的多视图用户偏好建模

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557683

Huaiwen Zhang;Tianci Wu;Yinwei Wei

{"title":"Multi-View User Preference Modeling for Personalized Text-to-Image Generation","authors":"Huaiwen Zhang;Tianci Wu;Yinwei Wei","doi":"10.1109/TMM.2025.3557683","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557683","url":null,"abstract":"Personalized text-to-image generation aims to synthesize images tailored to individual user preferences. Existing methods primarily generate customized content using a few reference images, which often struggle to mine user preferences from historical records, and thus fail to synthesize truly personalized content. In addition, it is difficult to directly incorporate the extracted feature of user preferences into the feature space of the generation model, since there exists a considerable gap between them. In this paper, we propose a novel multi-view personalized text-to-image generation method based on the diffusion model, named MVP-Diffusion, which learns instance- and user-level preferences from historical records and integrates them into the generation model. For instance-level user preference modeling, we employ a chain-of-thought prompting strategy to deduce preference keywords and integrate them into input prompts with the aid of a large language model. For user-level preference modeling, we construct a learnable embedding for each user to capture more comprehensive preferences by analyzing their historical records. An adaptive user preference fusion module is proposed to inject user preferences into the generation model via a set of learnable parameters. Experimental results demonstrate that the proposed method significantly enhances the personalization of the generated images compared to the other personalized text-to-image generation methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3082-3091"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144177413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Target Pose Estimation and Behavior Analysis Based on Symmetric Cascaded AdderNet 基于对称级联AdderNet的多目标姿态估计与行为分析

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557614

Xiaoshuo Jia;Qingzhen Xu;Aiqing Zhu;Xiaomei Kuang

{"title":"Multi-Target Pose Estimation and Behavior Analysis Based on Symmetric Cascaded AdderNet","authors":"Xiaoshuo Jia;Qingzhen Xu;Aiqing Zhu;Xiaomei Kuang","doi":"10.1109/TMM.2025.3557614","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557614","url":null,"abstract":"In the tasks of pose estimation and behavior analysis in computer vision, conventional models are often constrained by various factors or complex environments (such as multiple targets, small targets, occluded targets, etc.). To address this problem, this paper proposes a symmetric cascaded additive network (MulAG) to improve the accuracy of posture estimation and behavior analysis in complex environments. MulAG consists of two modules, MulA and MulG. The MulA module is designed based on a cascaded symmetric network structure and incorporates the addition operation. MulA extracts the posture spatial features of the target from a single frame image. And, the MulG module is designed based on three continuous GRUs (gated recurrent unit). Based on the MulA, MulG extracts the posture temporal features from the posture spatial features of the moving target and predicts the posture temporal features of the moving target. The paper firstly demonstrates the feasibility of addition operations in pose estimation tasks by comparing with MobileNet-v3 in ablation experiments. Secondly, on the HiEve and CrowdPose datasets, MulA achieves accuracy of 79.6% and 80.4%, respectively, outperforming the PTM model by 12.0% and 21.2%. Detection speed of MulA achieves the best value at 8.6 ms, which is 1 times higher than HDGCN. The result demonstrates the effectiveness of MulA in multi-target pose estimation in complex scenes. Finally, on the HDMB-51 and UCF-101 datasets, MulAG achieves accuracy of 74.8% and 86.3%, respectively, outperforming HDGCN by 9.6% and 9.5%. Compared with SKP and GIST, the fps of MulAG (44.8 s−1) is improved by 8.2% and 8.9%. These experiments highlight the generalizability and superiority of MulAG in behavior analysis and pose estimation tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3197-3209"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Guest Editorial: When Multimedia Meets Food: Multimedia Computing for Food Data Analysis and Applications 客座评论：当多媒体与食品相遇：食品数据分析和应用的多媒体计算

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-03-28 DOI: 10.1109/TMM.2025.3566452

Weiqing Min;Shuqiang Jiang;Petia Radeva;Vladimir Pavlovic;Chong-Wah Ngo;Kiyoharu Aizawa;Wanqing Li

引用次数: 0