Lili Wei;Congyan Lang;Zheming Xu;Liqian Liang;Jun Liu
{"title":"Few-Shot 3D Point Cloud Segmentation via Relation Consistency-Guided Heterogeneous Prototypes","authors":"Lili Wei;Congyan Lang;Zheming Xu;Liqian Liang;Jun Liu","doi":"10.1109/TMM.2025.3557699","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557699","url":null,"abstract":"Few-shot 3D point cloud semantic segmentation is a challenging task due to the lack of labeled point clouds (support set). To segment unlabeled query point clouds, existing prototype-based methods learn 3D prototypes from point features of the support set and then measure their distances to the query points. However, such homogeneous 3D prototypes are often of low quality because they overlook the valuable heterogeneous information buried in the support set, such as semantic labels and projected 2D depth maps. To address this issue, in this paper, we propose a novel <bold>R</b>elation <bold>C</b>onsistency-guided <bold>H</b>eterogeneous <bold>P</b>rototype learning framework (RCHP), which improves prototype quality by integrating heterogeneous information using large multi-modal models (<italic>e.g.</i> CLIP). RCHP achieves this through two core components: Heterogeneous Prototype Generation module which collaborates with 3D networks and CLIP to generate heterogeneous prototypes, and Heterogeneous Prototype Fusion module which effectively fuses heterogeneous prototypes to obtain high-quality prototypes. Furthermore, to bridge the gap between heterogeneous prototypes, we introduce a Heterogeneous Relation Consistency loss, which transfers more reliable inter-class relations (<italic>i.e.</i>, inter-prototype relations) from refined prototypes to heterogeneous ones. Extensive experiments conducted on five point cloud segmentation datasets, including four indoor datasets (S3DIS, ScanNet, SceneNN, NYU Depth V2) and one outdoor dataset (Semantic3D), demonstrate the superiority and generalization capability of our method, outperforming state-of-the-art approaches across all datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3158-3170"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RoG-SAM: A Language-Driven Framework for Instance-Level Robotic Grasping Detection","authors":"Yunpeng Mei;Jian Sun;Zhihong Peng;Fang Deng;Gang Wang;Jie Chen","doi":"10.1109/TMM.2025.3557685","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557685","url":null,"abstract":"Robotic grasping is a crucial topic in robotics and computer vision, with broad applications in industrial production and intelligent manufacturing. Although some methods have begun addressing instance-level grasping, most remain limited to predefined instances and categories, lacking flexibility for open-vocabulary grasp prediction based on user-specified instructions. To address this, we propose RoG-SAM, a language-driven, instance-level grasp detection framework built on Segment Anything Model (SAM). RoG-SAM utilizes open-vocabulary prompts for object localization and grasp pose prediction, adapting SAM through transfer learning with encoder adapters and multi-head decoders to extend its segmentation capabilities to grasp pose estimation. Experimental results show that RoG-SAM achieves competitive performance on single-object datasets (Cornell and Jacquard) and cluttered datasets (GraspNet-1Billion and OCID), with instance-level accuracies of 91.2% and 90.1%, respectively, while using only 28.3% of SAM's trainable parameters. The effectiveness of RoG-SAM was also validated in real-world environments.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3057-3068"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation","authors":"Weijun Zhuang;Bowen Dong;Zhilin Zhu;Zhijun Li;Jie Liu;Yaowei Wang;Xiaopeng Hong;Xin Li;Wangmeng Zuo","doi":"10.1109/TMM.2025.3557688","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557688","url":null,"abstract":"Accurately detecting objects and their interrelationships for Video Scene Graph Generation (VidSGG) confronts two primary challenges. The first involves the identification of active objects interacting with humans from the numerous background objects, while the second challenge is long-tailed distribution among predicate classes. To tackle these challenges, we propose STABILE, a novel framework with a spatial-temporal saliency-guided contrastive learning scheme. For the first challenge, STABILE features an active object retriever that includes an object saliency fusion block for enhancing object embeddings with motion cues alongside an object temporal encoder to capture temporal dependencies. For the second challenge, STABILE introduces an unbiased relationship representation learning module with an Unbiased Multi-Label (UML) contrastive loss to mitigate the effect of long-tailed distribution. With the enhancements in both aspects, STABILE substantially boosts the accuracy of scene graph generation. Extensive experiments demonstrate the superiority of STABILE, setting new benchmarks in the field by offering enhanced accuracy and unbiased scene graph generation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3092-3104"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging Concise Concepts With Probabilistic Modeling for Interpretable Visual Recognition","authors":"Yixuan Zhang;Chuanbin Liu;Yizhi Liu;Yifan Gao;Zhiying Lu;Hongtao Xie;Yongdong Zhang","doi":"10.1109/TMM.2025.3557677","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557677","url":null,"abstract":"Interpretable visual recognition is essential for decision-making in high-stakes situations. Recent advancements have automated the construction of interpretable models by leveraging Visual Language Models (VLMs) and Large Language Models (LLMs) with Concept Bottleneck Models (CBMs), which process a bottleneck layer associated with human-understandable concepts. However, existing methods suffer from two main problems: a) the collected concepts from LLMs could be redundant with task-irrelevant descriptions, resulting in an inferior concept space with potential mismatch. b) VLMs directly map the global deterministic image embeddings with fine-grained concepts results in an ambiguous process with imprecise mapping results. To address the above two issues, we propose a novel solution for CBMs with Concise Concept and Probabilistic Modeling (CCPM) that can achieve superior classification performance via high-quality concepts and precise mapping strategy. First, we leverage in-context examples as category-related clues to guide LLM concept generation process. To mitigate redundancy in the concept space, we propose a Relation-Aware Selection (RAS) module to obtain a concise concept set that is discriminative and relevant based on image-concept and inter-concept relationships. Second, for precise mapping, we employ a Probabilistic Distribution Adapter (PDA) that estimates the inherent ambiguity of the image embeddings of pre-trained VLMs to capture the complex relationships with concepts. Extensive experiments indicate that our model achieves state-of-the-art results with a 6.18% improvement in classification accuracy on eight mainstream recognition benchmarks as well as reliable explainability through interpretable analysis.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3117-3131"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Yao;Anqi Zhang;Yong Zhou;Jiaqi Zhao;Bing Liu;Abdulmotaleb El Saddik
{"title":"Adversarial Geometric Attacks for 3D Point Cloud Object Tracking","authors":"Rui Yao;Anqi Zhang;Yong Zhou;Jiaqi Zhao;Bing Liu;Abdulmotaleb El Saddik","doi":"10.1109/TMM.2025.3557613","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557613","url":null,"abstract":"3D point cloud object tracking (3D PCOT) plays a vital role in applications such as autonomous driving and robotics. Adversarial attacks offer a promising approach to enhance the robustness and security of tracking models. However, existing adversarial attack methods for 3D PCOT seldom leverage the geometric structure of point clouds and often overlook the transferability of attack strategies. To address these limitations, this paper proposes an adversarial geometric attack method tailored for 3D PCOT, which includes a point perturbation attack module (non-isometric transformation) and a rotation attack module (isometric transformation). First, we introduce a curvature-aware point perturbation attack module that enhances local transformations by applying normal perturbations to critical points identified through geometric features such as curvature and entropy. Second, we design a Thompson sampling-based rotation attack module that applies subtle global rotations to the point cloud, introducing tracking errors while maintaining imperceptibility. Additionally, we design a fused loss function to iteratively optimize the point cloud within the search region, generating adversarially perturbed samples. The proposed method is evaluated on multiple 3D PCOT models and validated through black-box tracking experiments on benchmarks. For P2B, white-box attacks on KITTI reduce the success rate from 53.3% to 29.6% and precision from 68.4% to 37.1%. On NuScenes, the success rate drops from 39.0% to 27.6%, and precision from 39.9 to 26.8%. Black-box attacks show a transferability, with BAT showing a maximum 47.0% drop in success rate and 47.2% in precision on KITTI, and a maximum 22.5% and 27.0% on NuScenes.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3144-3157"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144177412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ICE: Interactive 3D Game Character Facial Editing via Dialogue","authors":"Haoqian Wu;Minda Zhao;Zhipeng Hu;Changjie Fan;Lincheng Li;Weijie Chen;Rui Zhao;Xin Yu","doi":"10.1109/TMM.2025.3557611","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557611","url":null,"abstract":"Most recent popular Role-Playing Games (RPGs) allow players to create in-game characters with hundreds of adjustable parameters, including bone positions and various makeup options. Although text-driven auto-customization systems have been developed to simplify the complex process of adjusting these intricate character parameters, they are limited by their single-round generation and lack the capability for further editing and fine-tuning. In this paper, we propose an Interactive Character Editing framework (ICE) to achieve a multi-round dialogue-based refinement process. In a nutshell, our ICE offers a more user-friendly way to enable players to convey creative ideas iteratively while ensuring that created characters align with the expectations of players. Specifically, we propose an Instruction Parsing Module (IPM) that utilizes large language models (LLMs) to parse multi-round dialogues into clear editing instruction prompts in each round. To reliably and swiftly modify character control parameters at a fine-grained level, we propose a Semantic-guided Low-dimension Parameter Solver (SLPS) that edits character control parameters according to prompts in a zero-shot manner. Our SLPS first localizes the character control parameters related to the fine-grained modification, and then optimizes the corresponding parameters in a low-dimension space to avoid unrealistic results. Extensive experimental results demonstrate the effectiveness of our proposed ICE for in-game character creation and the superior editing performance of ICE.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3210-3223"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zixuan Ding;Zihan Zhou;Hui Chen;Tianxiang Hao;Yizhe Xiong;Sicheng Zhao;Qiang Zhang;Jungong Han
{"title":"Cross-Modality Prompts: Few-Shot Multi-Label Recognition With Single-Label Training","authors":"Zixuan Ding;Zihan Zhou;Hui Chen;Tianxiang Hao;Yizhe Xiong;Sicheng Zhao;Qiang Zhang;Jungong Han","doi":"10.1109/TMM.2025.3557700","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557700","url":null,"abstract":"Few-shot multi-label recognition (FS-MLR) presents a significant challenge due to the need to assign multiple labels to images with limited examples. Existing methods often struggle to balance the learning of novel classes and the retention of knowledge from base classes. To address this issue, we propose a novel Cross-Modality Prompts (CMP) approach. Unlike conventional methods that rely on additional semantic information to mitigate the impact of limited samples, our approach leverages multimodal prompts to adaptively tune the feature extraction network. A new FS-MLR benchmark is also proposed, which includes single-label training and multi-label testing, accompanied by benchmark datasets constructed from MS-COCO and NUS-WIDE. Extensive experiments on these datasets demonstrate the superior performance of our CMP approach, highlighting its effectiveness and adaptability. Our results show that CMP outperforms CoOp on the MS-COCO dataset with a maximal improvement of 19.47% and 23.94% in mAP<sub>harmonic</sub> for 5-way 1-shot and 5-way 5-shot settings, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3023-3033"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Segmenting Anything in the Dark via Depth Perception","authors":"Peng Liu;Jinhong Deng;Lixin Duan;Wen Li;Fengmao Lv","doi":"10.1109/TMM.2025.3557612","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557612","url":null,"abstract":"Image segmentation under low-light conditions is essential in real-world applications, such as autonomous driving and video surveillance systems. The recent Segment Anything Model (SAM) exhibits strong segmentation capability in various vision applications. However, its performance could be severely degraded under low-light conditions. On the other hand, multimodal information has been exploited to help models construct more comprehensive understanding of scenes under low-light conditions by providing complementary information (e.g., depth). Therefore, in this work, we present a pioneer attempt that elevates a unimodal vision foundation model (e.g., SAM) to a multimodal one, by efficiently integrating additional depth information under low-light conditions. To achieve that, we propose a novel method called Depth Perception SAM (DPSAM) based on the SAM framework. Specifically, we design a modality encoder to extract the depth information and the Depth Perception Layers (DPLs) for mutual feature refinement between RGB and depth features. The DPLs employ the cross-modal attention mechanism to mutually query effective information from both RGB and depth for the subsequent feature refinement. Thus, DPLs can effectively leverage the complementary information from depth to enrich the RGB representations and obtain comprehensive multimodal visual representations for segmenting anything in the dark. To this end, our DPSAM maximally maintains the instinct expertise of SAM for RGB image segmentation and further leverages on the strength of depth for enhanced segmenting anything capability, especially for cases that are likely to fail with RGB only (e.g., low-light or complex textures). As demonstrated by extensive experiments on four RGBD benchmark datasets, DPSAM clearly improves the performance for the segmenting anything performance in the dark, e.g., +12.90% mIoU and +16.23% mIoU on LLRGBD and DeLiVER, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2975-2986"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Modal Self-Perception Enhanced Large Language Model for 3D Region-of-Interest Captioning With Limited Data","authors":"Lu Shi;Shichao Kan;Yi Jin;Linna Zhang;Yigang Cen","doi":"10.1109/TMM.2025.3557703","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557703","url":null,"abstract":"3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2935-2948"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visual-Linguistic Feature Alignment With Semantic and Kinematic Guidance for Referring Multi-Object Tracking","authors":"Yizhe Li;Sanping Zhou;Zheng Qin;Le Wang","doi":"10.1109/TMM.2025.3557710","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557710","url":null,"abstract":"Referring Multi-Object Tracking (RMOT) aims to dynamically track an arbitrary number of referred targets in a video sequence according to the language expression. Previous methods mainly focus on cross-modal fusion at the feature level with designed structures. However, the insufficient visual-linguistic alignment is prone to causing visual-linguistic mismatches, leading to some targets being tracked but not correctly referred especially when facing the language expression with complex semantics or motion descriptions. To this end, we propose to conduct visual-linguistic alignment with semantic and kinematic guidance to effectively align the visual features with more diverse language expressions. In this paper, we put forward a novel end-to-end RMOT framework SKTrack, which follows the transformer-based architecture with a Language-Guided Decoder (LGD) and a Motion-Aware Aggregator (MAA). In particular, the LGD performs deep semantic interaction layer-by-layer in a single frame to enhance the alignment ability of the model, while the MAA conducts temporal feature fusion and alignment across multiple frames to enable the alignment between visual targets and language expression with motion descriptions. Extensive experiments on the Refer-KITTI and Refer-KITTI-v2 demonstrate that SKTrack achieves state-of-the-art performance and verify the effectiveness of our framework and its components.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3034-3044"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}