{"title":"A Multi-Granularity Relation Graph Aggregation Framework With Multimodal Clues for Social Relation Reasoning","authors":"Cong Xu;Feiyu Chen;Qi Jia;Yihua Wang;Liang Jin;Yunji Li;Yaqian Zhao;Changming Zhao","doi":"10.1109/TMM.2025.3543054","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543054","url":null,"abstract":"The social relation is a fundamental attribute of human beings in daily life. The ability of humans to form large organizations and institutions stems directly from our complex social networks. Therefore, understanding social relationships in the context of multimedia is crucial for building domain-specific or general artificial intelligence systems. The key to reason social relations lies in understanding the human interactions between individuals through multimodal representations such as action and utterance. However, due to video editing techniques and various narrative sequences in videos, two individuals with social relationships may not appear together in the same frame or clip. Additionally, social relations may manifest in different levels of granularity in video expressions. Previous research has not effectively addressed these challenges. Therefore, this paper proposes a <italic>Multi-Granularity Relation Graph Aggregation Framework</i> (<bold>MGRG</b>) to enhance the inference ability for social relation reasoning in multimedia content, like video. Different from existing methods, our method considers the paradigm of jointly inferring the relations by constructing a social relation graph. We design a hierarchical multimodal relation graph illustrating the exchange of information between individuals' roles, capturing the complex interactions at multi-levels of granularity from fine to coarse. In MGRG, we propose two aggregation modules to cluster multimodal features in different granularity layer relation graph, considering temporal aspects and importance. Experimental results show that our method generates a logical and coherent social relation graph and improves the performance in accuracy.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4961-4970"},"PeriodicalIF":9.7,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical Cross-Attention Network for Virtual Try-On","authors":"Hao Tang;Bin Ren;Pingping Wu;Nicu Sebe","doi":"10.1109/TMM.2025.3548437","DOIUrl":"https://doi.org/10.1109/TMM.2025.3548437","url":null,"abstract":"In this article, we present an innovative solution tailored for the intricate challenges of the virtual try-on task—our novel Hierarchical Cross-Attention Network, HCANet. HCANet is meticulously crafted with two primary stages: geometric matching and try-on, each playing a crucial role in delivering realistic and visually convincing virtual try-on outcomes. A distinctive feature of HCANet is the incorporation of a novel Hierarchical Cross-Attention (HCA) block into both stages, enabling the effective capture of long-range correlations between individual and clothing modalities. The HCA block functions as a cornerstone, enhancing the depth and robustness of the network. By adopting a hierarchical approach, it facilitates a nuanced representation of the interaction between the person and clothing, capturing intricate details essential for an authentic virtual try-on experience. Our extensive set of experiments establishes the prowess of HCANet. The results showcase its cutting-edge performance across both objective quantitative metrics and subjective evaluations of visual realism. HCANet stands out as a state-of-the-art solution, demonstrating its capability to generate virtual try-on results that not only excel in accuracy but also satisfy subjective criteria of realism. This marks a significant step forward in advancing the field of virtual try-on technologies.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4454-4466"},"PeriodicalIF":9.7,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SSFam: Scribble Supervised Salient Object Detection Family","authors":"Zhengyi Liu;Sheng Deng;Xinrui Wang;Linbo Wang;Xianyong Fang;Bin Tang","doi":"10.1109/TMM.2025.3543092","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543092","url":null,"abstract":"Scribble supervised salient object detection (SSSOD) constructs segmentation ability of attractive objects from surroundings under the supervision of sparse scribble labels. For the better segmentation, depth and thermal infrared modalities serve as the supplement to RGB images in the complex scenes. Existing methods specifically design various feature extraction and multi-modal fusion strategies for RGB, RGB-Depth, RGB-Thermal, and Visual-Depth-Thermal image input respectively, leading to similar model flood. As the recently proposed Segment Anything Model (SAM) possesses extraordinary segmentation and prompt interactive capability, we propose an SSSOD family based on SAM, named <italic>SSFam</i>, for the combination input with different modalities. Firstly, different modal-aware modulators are designed to attain modal-specific knowledge which cooperates with modal-agnostic information extracted from the frozen SAM encoder for the better feature ensemble. Secondly, a siamese decoder is tailored to bridge the gap between the training with scribble prompt and the testing with no prompt for the stronger decoding ability. Our model demonstrates the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1988-2000"},"PeriodicalIF":8.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Emotion-Oriented Cross-Modal Prompting and Alignment for Human-Centric Emotional Video Captioning","authors":"Yu Wang;Yuanyuan Liu;Shunping Zhou;Yuxuan Huang;Chang Tang;Wujie Zhou;Zhe Chen","doi":"10.1109/TMM.2025.3535292","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535292","url":null,"abstract":"Human-centric Emotional Video Captioning (H-EVC) aims to generate fine-grained, emotion-related sentences for human-based videos, enhancing the understanding of human emotions and facilitating human-computer emotional interaction. However, existing video captioning methods often overlook subtle emotional clues and interactions in videos. As a result, the generated captions frequently lack emotional information. To address this, we propose <bold>E</b>motion-oriented <bold>C</b>ross-modal <bold>P</b>rompting and <bold>A</b>lignment (ECPA), which improves HEVC accuracy by modeling fine-grained visual-textual emotion clues. Using large foundation models, ECPA introduces two learnable prompting strategies: visual emotion prompting (VEP) and textual emotion prompting (TEP), along with an emotion-oriented cross-modal alignment (ECA) module. VEP uses two levels of visual prompts, <italic>i.e.</i>, emotion recognition (ER) and action unit (AU), to focus on both coarse and fine visual emotional features. TEP devise two-level learnable textual prompts, <italic>i.e.</i>, sentence-level emotional tokens and word-level masked tokens to capture global and local textual emotion representations. ECA introduces another two levels of emotion-oriented prompt alignment learning mechanisms: the ER-sentence level and the AU-word level alignment losses. Both enhance the model's ability to capture and integrate both global and local cross-modal emotion semantics, thereby enabling the generation of fine-grained emotional linguistic descriptions in video captioning. Experiments show ECPA significantly outperforms state-of-the-art methods on various H-EVC datasets (relative improvements of 9.98%, 5.72%, 4.46%, 24.52% on MAFW, and 12.82%, 20.27%, 4.22%, 5.01% on EmVidCap across four evaluation metrics) and supports zero-shot tasks on MSVD and MSRVTT, demonstrating strong applicability and generalization.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3766-3780"},"PeriodicalIF":8.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Wang;Shoujin Wang;Quangui Zhang;Qiang Wu;Min Xu
{"title":"Federated User Preference Modeling for Privacy-Preserving Cross-Domain Recommendation","authors":"Li Wang;Shoujin Wang;Quangui Zhang;Qiang Wu;Min Xu","doi":"10.1109/TMM.2025.3543106","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543106","url":null,"abstract":"Cross-domain recommendation (CDR) aims to address the data-sparsity problem by transferring knowledge across domains. Existing CDR methods generally assume that the user-item interaction data is shareable between domains, which leads to privacy leakage. Recently, some privacy-preserving CDR (PPCDR) models have been proposed to solve this problem. However, they primarily transfer simple representations learned only from user-item interaction histories, overlooking other useful side information, leading to inaccurate user preferences. Additionally, they transfer differentially private user-item interaction matrices or embeddings across domains to protect privacy. However, these methods offer limited privacy protection, as attackers may exploit external information to infer the original data. To address these challenges, we propose a novel Federated User Preference Modeling (FUPM) framework. In FUPM, first, a novel comprehensive preference exploration module is proposed to learn users' comprehensive preferences from both interaction data and additional data including review texts and potentially positive items. Next, a private preference transfer module is designed to first learn differentially private local and global prototypes, and then privately transfer the global prototypes using a federated learning strategy. These prototypes are generalized representations of user groups, making it difficult for attackers to infer individual information. Extensive experiments on four CDR tasks conducted on the Amazon and Douban datasets validate the superiority of FUPM over SOTA baselines.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5324-5336"},"PeriodicalIF":9.7,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image Compressive Sensing With Scale-Variable Adaptive Sampling and Hybrid-Attention Transformer Reconstruction","authors":"Chen Hui;Debin Zhao;Weisi Lin;Shaohui Liu;Feng Jiang","doi":"10.1109/TMM.2025.3535114","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535114","url":null,"abstract":"Recently, a large number of image compressive sensing (CS) methods with deep unfolding networks (DUNs) have been proposed. However, existing methods either use fixed-scale blocks for sampling that leads to limited insights into the image content or employ a plain convolutional neural network (CNN) in each iteration that weakens the perception of broader contextual prior. In this paper, we propose a novel DUN (dubbed SVASNet) for image compressive sensing, which achieves scale-variable adaptive sampling and hybrid-attention Transformer reconstruction with a single model. Specifically, for scale-variable sampling, a sampling matrix-based calculator is first employed to evaluate the reconstruction distortion, which only requires measurements without access to the ground truth image. Then, a Block Scale Aggregation (BSA) strategy is presented to compute the reconstruction distortion under block divisions at different scales and select the optimal division scale for sampling. To realize hybrid-attention reconstruction, a dual Cross Attention (CA) submodule in the gradient descent step and a Spatial Attention (SA) submodule in the proximal mapping step are developed. The CA submodule introduces inter-phase inertial forces in the gradient descent, which improves the memory effect between adjacent iterations. The SA submodule integrates local and global prior representations of CNN and Transformer, and explores local and global affinities between dense feature representations. Extensive experimental results show that the proposed SVASNet achieves significant improvements over the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4333-4347"},"PeriodicalIF":8.4,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clothes-Changing Person Re-Identification With Feasibility-Aware Intermediary Matching","authors":"Jiahe Zhao;Ruibing Hou;Hong Chang;Xinqian Gu;Bingpeng Ma;Shiguang Shan;Xilin Chen","doi":"10.1109/TMM.2025.3535357","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535357","url":null,"abstract":"Current clothes-changing person re-identification (re-id) approaches usually perform retrieval based on clothes-irrelevant features, while neglecting the potential of clothes-relevant features. However, we observe that relying solely on clothes-irrelevant features for clothes-changing re-id is limited, since they often lack adequate identity information and suffer from large intra-class variations. On the contrary, clothes-relevant features can be used to discover same-clothes intermediaries that possess informative identity clues. Based on this observation, we propose a Feasibility-Aware Intermediary Matching (FAIM) framework to additionally utilize <bold>clothes-relevant features</b> for retrieval. First, an Intermediary Matching (IM) module is designed to perform an intermediary-assisted matching process. This process involves using clothes-relevant features to find informative intermediates, and then using clothes-irrelevant features of these intermediates to complete the matching. Second, in order to reduce the negative effect of low-quality intermediaries, an Intermediary-Based Feasibility Weighting (IBFW) module is designed to evaluate the feasibility of intermediary matching process by assessing the quality of intermediaries. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on several widely-used clothes-changing re-id benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3307-3319"},"PeriodicalIF":8.4,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"$mathrm{Tri^{2}plane}$: Advancing Neural Implicit Surface Reconstruction for Indoor Scenes","authors":"Yiping Xie;Haihong Xiao;Wenxiong Kang","doi":"10.1109/TMM.2025.3565989","DOIUrl":"https://doi.org/10.1109/TMM.2025.3565989","url":null,"abstract":"Reconstructing 3D indoor scenes presents significant challenges, requiring models capable of inferring both planar surfaces and intricate details. Although recent methods can generate complete surfaces, they often struggle to simultaneously reconstruct low-texture regions and high-frequency details due to non-local effects. In this paper, we introduce a novel triangle-based triplane representation, named (tri<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>plane), specifically designed to account for the diverse spatial feature distribution and information density of indoor environments. Our method begins by projecting point clouds onto three orthogonal planes, followed by 2D Delaunay triangulation. This representation enables adaptive encoding of low-texture and high-frequency regions by employing triangles of variable sizes. Moreover, we develop a dual tri<inline-formula><tex-math>$^{2}$</tex-math></inline-formula> plane framework that incorporates both geometric and semantic information, significantly enhancing the reconstruction quality. We combine these key modules and evaluate our method on benchmark indoor scene datasets. The results unequivocally demonstrate the superiority of our proposed method over the state-of-the-art Occ-SDF. Specifically, our method achieves significant improvements over Occ-SDF, with margins of 1.3, 1.7, and 2.3 in F-score on the ScanNet, Tanks & Temples, and Replica datasets, respectively. To facilitate further research, we will make our code publicly available.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4910-4923"},"PeriodicalIF":9.7,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haomiao Xiong;Yunzhi Zhuge;Jiawen Zhu;Lu Zhang;Huchuan Lu
{"title":"3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding","authors":"Haomiao Xiong;Yunzhi Zhuge;Jiawen Zhu;Lu Zhang;Huchuan Lu","doi":"10.1109/TMM.2025.3557620","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557620","url":null,"abstract":"Multi-modal Large Language Models (MLLMs) exhibit impressive capabilities in 2D tasks, yet encounter challenges in discerning the spatial positions, interrelations, and causal logic in scenes when transitioning from 2D to 3D representations. We find that the limitations mainly lie in: i) the high annotation cost restricting the scale-up of volumes of 3D scene data, and ii) the lack of a straightforward and effective way to perceive 3D information which results in prolonged training durations and complicates the streamlined framework. To this end, we develop a pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs and construct 3DS-160 K, to enhance the pre-training process. Leveraging this high-quality pre-training data, we introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes, showcasing exceptional capability in navigating the complexities of the physical world. 3UR-LLM directly receives 3D point cloud as input and project 3D features fused with text instructions into a manageable set of tokens. Considering the computation burden derived from these hybrid tokens, we design a 3D compressor module to cohesively compress the 3D spatial cues and textual narrative. 3UR-LLM achieves promising performance with respect to the previous SOTAs, for instance, 3UR-LLM exceeds its counterparts by 7.1% CIDEr on ScanQA, while utilizing fewer training resources. The code and model weights for 3UR-LLM and the 3DS-160 K benchmark are available at 3UR-LLM.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2899-2911"},"PeriodicalIF":8.4,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144170922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Wang;Qingbo Wu;Desen Yuan;King Ngi Ngan;Hongliang Li;Fanman Meng;Linfeng Xu
{"title":"Learning With Noisy Low-Cost MOS for Image Quality Assessment via Dual-Bias Calibration","authors":"Lei Wang;Qingbo Wu;Desen Yuan;King Ngi Ngan;Hongliang Li;Fanman Meng;Linfeng Xu","doi":"10.1109/TMM.2025.3543014","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543014","url":null,"abstract":"Learning-based Image Quality Assessment (IQA) models have obtained impressive performance with the help of reliable subjective quality labels, where Mean Opinion Score (MOS) is the most popular choice. However, in view of the subjective bias of individual annotators, the Labor-Abundant MOS (LA-MOS) typically requires large collections of opinion scores from multiple annotators for each image, which significantly increases the learning cost. In this paper, we aim to learn robust IQA models from Low-Cost MOS (LC-MOS), which only requires very few opinion scores or even a single opinion score for each image. More specifically, we consider the LC-MOS as the noisy observation of LA-MOS and enforce the IQA model learned from LC-MOS to approach the unbiased estimation of LA-MOS. Thus, we represent the subjective bias between LC-MOS and LA-MOS, and the model bias between IQA predictions learned from LC-MOS and LA-MOS (i.e., dual-bias) as two latent variables with unknown parameters. By means of the expectation-maximization-based alternating optimization, we can jointly estimate the parameters of the dual-bias, which suppresses the misleading of LC-MOS via a gated dual-bias calibration (GDBC) module. To the best of our knowledge, this is the first exploration of robust IQA model learning from noisy low-cost labels. Theoretical analysis and extensive experiments on four popular IQA datasets show that the proposed method is robust toward different bias rates and annotation numbers and significantly outperforms the other Learning-based IQA models when only LC-MOS is available. Furthermore, we also achieve comparable performance with respect to the other models learned with LA-MOS.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5617-5631"},"PeriodicalIF":9.7,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}