IEEE Transactions on Multimedia最新文献_第5页

Leveraging Concise Concepts With Probabilistic Modeling for Interpretable Visual Recognition 利用简洁的概念和概率模型进行可解释的视觉识别

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557677

Yixuan Zhang;Chuanbin Liu;Yizhi Liu;Yifan Gao;Zhiying Lu;Hongtao Xie;Yongdong Zhang

{"title":"Leveraging Concise Concepts With Probabilistic Modeling for Interpretable Visual Recognition","authors":"Yixuan Zhang;Chuanbin Liu;Yizhi Liu;Yifan Gao;Zhiying Lu;Hongtao Xie;Yongdong Zhang","doi":"10.1109/TMM.2025.3557677","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557677","url":null,"abstract":"Interpretable visual recognition is essential for decision-making in high-stakes situations. Recent advancements have automated the construction of interpretable models by leveraging Visual Language Models (VLMs) and Large Language Models (LLMs) with Concept Bottleneck Models (CBMs), which process a bottleneck layer associated with human-understandable concepts. However, existing methods suffer from two main problems: a) the collected concepts from LLMs could be redundant with task-irrelevant descriptions, resulting in an inferior concept space with potential mismatch. b) VLMs directly map the global deterministic image embeddings with fine-grained concepts results in an ambiguous process with imprecise mapping results. To address the above two issues, we propose a novel solution for CBMs with Concise Concept and Probabilistic Modeling (CCPM) that can achieve superior classification performance via high-quality concepts and precise mapping strategy. First, we leverage in-context examples as category-related clues to guide LLM concept generation process. To mitigate redundancy in the concept space, we propose a Relation-Aware Selection (RAS) module to obtain a concise concept set that is discriminative and relevant based on image-concept and inter-concept relationships. Second, for precise mapping, we employ a Probabilistic Distribution Adapter (PDA) that estimates the inherent ambiguity of the image embeddings of pre-trained VLMs to capture the complex relationships with concepts. Extensive experiments indicate that our model achieves state-of-the-art results with a 6.18% improvement in classification accuracy on eight mainstream recognition benchmarks as well as reliable explainability through interpretable analysis.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3117-3131"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adversarial Geometric Attacks for 3D Point Cloud Object Tracking 三维点云目标跟踪的对抗性几何攻击

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557613

Rui Yao;Anqi Zhang;Yong Zhou;Jiaqi Zhao;Bing Liu;Abdulmotaleb El Saddik

{"title":"Adversarial Geometric Attacks for 3D Point Cloud Object Tracking","authors":"Rui Yao;Anqi Zhang;Yong Zhou;Jiaqi Zhao;Bing Liu;Abdulmotaleb El Saddik","doi":"10.1109/TMM.2025.3557613","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557613","url":null,"abstract":"3D point cloud object tracking (3D PCOT) plays a vital role in applications such as autonomous driving and robotics. Adversarial attacks offer a promising approach to enhance the robustness and security of tracking models. However, existing adversarial attack methods for 3D PCOT seldom leverage the geometric structure of point clouds and often overlook the transferability of attack strategies. To address these limitations, this paper proposes an adversarial geometric attack method tailored for 3D PCOT, which includes a point perturbation attack module (non-isometric transformation) and a rotation attack module (isometric transformation). First, we introduce a curvature-aware point perturbation attack module that enhances local transformations by applying normal perturbations to critical points identified through geometric features such as curvature and entropy. Second, we design a Thompson sampling-based rotation attack module that applies subtle global rotations to the point cloud, introducing tracking errors while maintaining imperceptibility. Additionally, we design a fused loss function to iteratively optimize the point cloud within the search region, generating adversarially perturbed samples. The proposed method is evaluated on multiple 3D PCOT models and validated through black-box tracking experiments on benchmarks. For P2B, white-box attacks on KITTI reduce the success rate from 53.3% to 29.6% and precision from 68.4% to 37.1%. On NuScenes, the success rate drops from 39.0% to 27.6%, and precision from 39.9 to 26.8%. Black-box attacks show a transferability, with BAT showing a maximum 47.0% drop in success rate and 47.2% in precision on KITTI, and a maximum 22.5% and 27.0% on NuScenes.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3144-3157"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144177412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ICE: Interactive 3D Game Character Facial Editing via Dialogue ICE：通过对话进行交互式3D游戏角色面部编辑

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557611

Haoqian Wu;Minda Zhao;Zhipeng Hu;Changjie Fan;Lincheng Li;Weijie Chen;Rui Zhao;Xin Yu

{"title":"ICE: Interactive 3D Game Character Facial Editing via Dialogue","authors":"Haoqian Wu;Minda Zhao;Zhipeng Hu;Changjie Fan;Lincheng Li;Weijie Chen;Rui Zhao;Xin Yu","doi":"10.1109/TMM.2025.3557611","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557611","url":null,"abstract":"Most recent popular Role-Playing Games (RPGs) allow players to create in-game characters with hundreds of adjustable parameters, including bone positions and various makeup options. Although text-driven auto-customization systems have been developed to simplify the complex process of adjusting these intricate character parameters, they are limited by their single-round generation and lack the capability for further editing and fine-tuning. In this paper, we propose an Interactive Character Editing framework (ICE) to achieve a multi-round dialogue-based refinement process. In a nutshell, our ICE offers a more user-friendly way to enable players to convey creative ideas iteratively while ensuring that created characters align with the expectations of players. Specifically, we propose an Instruction Parsing Module (IPM) that utilizes large language models (LLMs) to parse multi-round dialogues into clear editing instruction prompts in each round. To reliably and swiftly modify character control parameters at a fine-grained level, we propose a Semantic-guided Low-dimension Parameter Solver (SLPS) that edits character control parameters according to prompts in a zero-shot manner. Our SLPS first localizes the character control parameters related to the fine-grained modification, and then optimizes the corresponding parameters in a low-dimension space to avoid unrealistic results. Extensive experimental results demonstrate the effectiveness of our proposed ICE for in-game character creation and the superior editing performance of ICE.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3210-3223"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Modality Prompts: Few-Shot Multi-Label Recognition With Single-Label Training 跨模态提示：单标签训练的少射多标签识别

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557700

Zixuan Ding;Zihan Zhou;Hui Chen;Tianxiang Hao;Yizhe Xiong;Sicheng Zhao;Qiang Zhang;Jungong Han

{"title":"Cross-Modality Prompts: Few-Shot Multi-Label Recognition With Single-Label Training","authors":"Zixuan Ding;Zihan Zhou;Hui Chen;Tianxiang Hao;Yizhe Xiong;Sicheng Zhao;Qiang Zhang;Jungong Han","doi":"10.1109/TMM.2025.3557700","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557700","url":null,"abstract":"Few-shot multi-label recognition (FS-MLR) presents a significant challenge due to the need to assign multiple labels to images with limited examples. Existing methods often struggle to balance the learning of novel classes and the retention of knowledge from base classes. To address this issue, we propose a novel Cross-Modality Prompts (CMP) approach. Unlike conventional methods that rely on additional semantic information to mitigate the impact of limited samples, our approach leverages multimodal prompts to adaptively tune the feature extraction network. A new FS-MLR benchmark is also proposed, which includes single-label training and multi-label testing, accompanied by benchmark datasets constructed from MS-COCO and NUS-WIDE. Extensive experiments on these datasets demonstrate the superior performance of our CMP approach, highlighting its effectiveness and adaptability. Our results show that CMP outperforms CoOp on the MS-COCO dataset with a maximal improvement of 19.47% and 23.94% in mAP<sub>harmonic</sub> for 5-way 1-shot and 5-way 5-shot settings, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3023-3033"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Segmenting Anything in the Dark via Depth Perception 通过深度感知在黑暗中分割任何东西

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557612

Peng Liu;Jinhong Deng;Lixin Duan;Wen Li;Fengmao Lv

{"title":"Segmenting Anything in the Dark via Depth Perception","authors":"Peng Liu;Jinhong Deng;Lixin Duan;Wen Li;Fengmao Lv","doi":"10.1109/TMM.2025.3557612","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557612","url":null,"abstract":"Image segmentation under low-light conditions is essential in real-world applications, such as autonomous driving and video surveillance systems. The recent Segment Anything Model (SAM) exhibits strong segmentation capability in various vision applications. However, its performance could be severely degraded under low-light conditions. On the other hand, multimodal information has been exploited to help models construct more comprehensive understanding of scenes under low-light conditions by providing complementary information (e.g., depth). Therefore, in this work, we present a pioneer attempt that elevates a unimodal vision foundation model (e.g., SAM) to a multimodal one, by efficiently integrating additional depth information under low-light conditions. To achieve that, we propose a novel method called Depth Perception SAM (DPSAM) based on the SAM framework. Specifically, we design a modality encoder to extract the depth information and the Depth Perception Layers (DPLs) for mutual feature refinement between RGB and depth features. The DPLs employ the cross-modal attention mechanism to mutually query effective information from both RGB and depth for the subsequent feature refinement. Thus, DPLs can effectively leverage the complementary information from depth to enrich the RGB representations and obtain comprehensive multimodal visual representations for segmenting anything in the dark. To this end, our DPSAM maximally maintains the instinct expertise of SAM for RGB image segmentation and further leverages on the strength of depth for enhanced segmenting anything capability, especially for cases that are likely to fail with RGB only (e.g., low-light or complex textures). As demonstrated by extensive experiments on four RGBD benchmark datasets, DPSAM clearly improves the performance for the segmenting anything performance in the dark, e.g., +12.90% mIoU and +16.23% mIoU on LLRGBD and DeLiVER, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2975-2986"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Modal Self-Perception Enhanced Large Language Model for 3D Region-of-Interest Captioning With Limited Data 有限数据下三维兴趣区域标注的多模态自我感知增强大语言模型

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557703

Lu Shi;Shichao Kan;Yi Jin;Linna Zhang;Yigang Cen

{"title":"Multi-Modal Self-Perception Enhanced Large Language Model for 3D Region-of-Interest Captioning With Limited Data","authors":"Lu Shi;Shichao Kan;Yi Jin;Linna Zhang;Yigang Cen","doi":"10.1109/TMM.2025.3557703","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557703","url":null,"abstract":"3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2935-2948"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Visual-Linguistic Feature Alignment With Semantic and Kinematic Guidance for Referring Multi-Object Tracking 基于语义和运动引导的多目标跟踪视觉语言特征对齐

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557710

Yizhe Li;Sanping Zhou;Zheng Qin;Le Wang

{"title":"Visual-Linguistic Feature Alignment With Semantic and Kinematic Guidance for Referring Multi-Object Tracking","authors":"Yizhe Li;Sanping Zhou;Zheng Qin;Le Wang","doi":"10.1109/TMM.2025.3557710","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557710","url":null,"abstract":"Referring Multi-Object Tracking (RMOT) aims to dynamically track an arbitrary number of referred targets in a video sequence according to the language expression. Previous methods mainly focus on cross-modal fusion at the feature level with designed structures. However, the insufficient visual-linguistic alignment is prone to causing visual-linguistic mismatches, leading to some targets being tracked but not correctly referred especially when facing the language expression with complex semantics or motion descriptions. To this end, we propose to conduct visual-linguistic alignment with semantic and kinematic guidance to effectively align the visual features with more diverse language expressions. In this paper, we put forward a novel end-to-end RMOT framework SKTrack, which follows the transformer-based architecture with a Language-Guided Decoder (LGD) and a Motion-Aware Aggregator (MAA). In particular, the LGD performs deep semantic interaction layer-by-layer in a single frame to enhance the alignment ability of the model, while the MAA conducts temporal feature fusion and alignment across multiple frames to enable the alignment between visual targets and language expression with motion descriptions. Extensive experiments on the Refer-KITTI and Refer-KITTI-v2 demonstrate that SKTrack achieves state-of-the-art performance and verify the effectiveness of our framework and its components.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3034-3044"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Open-Vocabulary Video Semantic Segmentation 面向开放词汇的视频语义分割

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557719

Xinhao Li;Yun Liu;Guolei Sun;Min Wu;Le Zhang;Ce Zhu

{"title":"Towards Open-Vocabulary Video Semantic Segmentation","authors":"Xinhao Li;Yun Liu;Guolei Sun;Min Wu;Le Zhang;Ce Zhu","doi":"10.1109/TMM.2025.3557719","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557719","url":null,"abstract":"Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS's zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2924-2934"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Adaptive Framework Embedded With LLM for Knowledge Graph Construction 一种嵌入LLM的自适应知识图谱构建框架

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557717

Qingwang Wang;Chaohui Li;Yi Liu;Qiubai Zhu;Jian Song;Tao Shen

{"title":"An Adaptive Framework Embedded With LLM for Knowledge Graph Construction","authors":"Qingwang Wang;Chaohui Li;Yi Liu;Qiubai Zhu;Jian Song;Tao Shen","doi":"10.1109/TMM.2025.3557717","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557717","url":null,"abstract":"Knowledge graph construction is aimed at storing and representing the knowledge of the objective world in a structured form. Existing methods for automatic construction of knowledge graphs have problems such as difficulty in understanding potential semantics and low precision. The emergence of Large Language Models (LLMs) provides an effective way for automatic knowledge graph construction. However, using LLMs as automatic knowledge graph construction engines relies on the embedding of schema layers, which brings challenges to the input length of LLMs. In this paper, we present a framework for Adaptive Construction of Knowledge Graph by leveraging the exceptional generation capabilities of LLMs and the latent relational semantic information of triples, named ACKG-LLM. Our proposed framework divides the knowledge graph construction task into three subtasks within a unified pipeline: triple extraction of open information, additional relational semantic information embedding and knowledge graph normalization based on schema-level embedding. The framework can construct knowledge graphs in different domains, making up for the defects of existing frameworks that need to retrain and fine-tune the internal model. Extensive experiments demonstrate that our proposed ACKG-LLM performs favorably against representative methods on the REBEL and WiKi-NRE datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2912-2923"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ExpLLM: Towards Chain of Thought for Facial Expression Recognition 面向面部表情识别的思维链

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557704

Xing Lan;Jian Xue;Ji Qi;Dongmei Jiang;Ke Lu;Tat-Seng Chua

{"title":"ExpLLM: Towards Chain of Thought for Facial Expression Recognition","authors":"Xing Lan;Jian Xue;Ji Qi;Dongmei Jiang;Ke Lu;Tat-Seng Chua","doi":"10.1109/TMM.2025.3557704","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557704","url":null,"abstract":"Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU's name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3069-3081"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0