{"title":"Delving Into Quaternion Wavelet Transformer for Facial Expression Recognition in the Wild","authors":"Yu Zhou;Jialun Pei;Weixin Si;Jing Qin;Pheng-Ann Heng","doi":"10.1109/TMM.2025.3535361","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535361","url":null,"abstract":"The Facial Expression Recognition (FER) technique has increasingly matured over time. However, recognizing facial expressions in wild environments poses great challenges in achieving promising performance. The main obstacles arise from various factors, such as illumination changes, head pose variations, and occlusions. To overcome interferences from external environments and improve recognition accuracy, we propose a novel Quaternion Wavelet TRansformer (<italic>QWTR</i>) model for FER in the wild. Specifically, we present a Quaternion Value Transformer (<italic>QVT</i>) network that combines quaternion multi-head attention with quaternion CNN to capture emotional cues from global and local perception. To preserve the color structure while enhancing image contrast and brightness, we introduce a Quaternion Histogram Equalization (<italic>QHE</i>) representation to transform color images into quaternion matrices representation. After that, to alleviate the impact of head pose and occlusion together with feature redundancy, a Quaternion Wavelet Feature Selection (<italic>QWFS</i>) scheme is designed to decompose quaternion features and select the most correlated signals. Extensive experiments have been conducted on four in-the-wild FER datasets and several specific FER benchmarks under various conditions. The qualitative and quantitative results demonstrate that <italic>QWTR</i> outperforms other state-of-the-art methods in FER benchmarks, e.g., 68.37% vs. 66.31% accuracy on the AffectNet dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3895-3909"},"PeriodicalIF":8.4,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RSUIA: Dynamic No-Reference Underwater Image Assessment via Reinforcement Sequences","authors":"Jingchun Zhou;Chunjiang Liu;Dehuan Zhang;Zongxin He;Ferdous Sohel;Qiuping Jiang","doi":"10.1109/TMM.2025.3535308","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535308","url":null,"abstract":"Underwater image quality assessment (UIQA) is a challenging task due to the complexities of underwater environments. Traditional UIQA methods primarily rely on fitting mean opinion scores (MOS), which are limited by human visual biases. To address the above limitation, we propose a no-reference underwater image quality assessment paradigm using reinforcement sequences. Our paradigm leverages reinforcement learning to iteratively merge the input image with the corresponding ground truth, generating an optimized sequence of images. A classifier generates probability arrays for the optimized sequence, which are converted into objective scores by a regression model. Unlike existing methods that focus solely on the final quality score, our paradigm emphasizes dynamic quality changes throughout the image-enhancement process. By employing objective mixing ratio labels, our reinforcement sequence dataset reduces subjective bias. The multiscale classifier captures local and global information differences between the input and ground truth images, effectively preserving the contrast and detail in diverse lighting conditions. Our paradigm combines multi-source data classification with support vector regression, optimizing the mapping of feature vectors to quality scores through fine-tuning libsvm kernel parameters. Experimental results on multiple benchmark datasets demonstrate that our paradigm outperforms the state-of-the-art UIQA methods, providing an effective solution for Underwater Image quality Assessment via Reinforcement Sequences (RSUIA).","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3542-3555"},"PeriodicalIF":8.4,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ETC: Temporal Boundary Expand Then Clarify for Weakly Supervised Video Grounding With Multimodal Large Language Model","authors":"Guozhang Li;Xinpeng Ding;De Cheng;Jie Li;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3521758","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521758","url":null,"abstract":"Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotations, explicit supervision methods (i.e., generating pseudo-temporal boundaries for training) have achieved great success. However, data augmentation in these methods might disrupt critical temporal information, yielding poor pseudo-temporal boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose <bold>ETC</b> (<bold>E</b>xpand <bold>t</b>hen <bold>C</b>larify), first using the additional information to expand the initial incomplete pseudo-temporal boundaries, and subsequently refining these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multi-modal large language models (MLLMs) to annotate each frame within the initial pseudo-temporal boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise in expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1772-1782"},"PeriodicalIF":8.4,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language-Assisted 3D Scene Understanding","authors":"Yanmin Wu;Qiankun Gao;Renrui Zhang;Haijie Li;Jian Zhang","doi":"10.1109/TMM.2025.3535305","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535305","url":null,"abstract":"The scale and quality of point cloud datasets constrain the advancement of point cloud learning. Recently, with the development of multi-modal learning, the incorporation of domain-agnostic prior knowledge from other modalities, such as images and text, to assist in point cloud feature learning has been considered a promising avenue. Existing methods have demonstrated the effectiveness of multi-modal contrastive training and feature distillation on point clouds. However, challenges remain, including the requirement for paired triplet data, redundancy and ambiguity in supervised features, and the disruption of the original priors. In this paper, we propose a <bold>l</b>anguage-<bold>as</b>sis<bold>t</b>ed approach to <bold>p</b>oint <bold>c</b>loud feature <bold>l</b>earning (<bold>LAST-PCL</b>), enriching semantic concepts through large language model-based text enrichment. We achieve de-redundancy and feature dimensionality reduction without compromising textual priors by statistical-based and training-free significant feature selection. Furthermore, we also delve into an in-depth analysis of the impact of text contrastive training on the point cloud. Extensive experiments validate that the proposed method learns semantically meaningful point cloud features and achieves state-of-the-art or comparable performance in 3D semantic segmentation, 3D object detection, and 3D scene classification tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3869-3879"},"PeriodicalIF":8.4,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QRNet: Quaternion-Based Refinement Network for Surface Normal Estimation","authors":"Hanlin Bai;Xin Gao;Wei Deng;Jianwang Gan;Yijin Xiong;Kangkang Kou;Guoying Zhang","doi":"10.1109/TMM.2025.3535299","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535299","url":null,"abstract":"In recent years, there has been a notable increase in interest in image-based surface normal estimation. These approaches are capable of predicting the surface normal of real scenes using only an image, thereby facilitating a more profound comprehension of the actual scene and providing assistance with other perceptual tasks. However, dense regression predictions are susceptible to misdirection when encountering intricate details, which presents a paradoxical challenge for image-based surface normal estimation in reconciling detail and density. By introducing quaternion rotations as fusion module with geometric property, we propose a quaternion-based refined network structure that fuses detailed and structural information. Specifically, we design a high-resolution surface normal baseline with a streamlined structure, to extract fine-grained features while reducing the angular error in surface normal regression values caused by downsampling. Additionally, we propose a subtle angle loss function that prevents subtle changes from being overlooked without extra information, further enhancing the model's ability to learn detailed information. The proposed method demonstrates state-of-the-art performance compared to existing techniques on three real-world datasets comprising indoor and outdoor scenes. The results demonstrate the robust effectiveness of our deep learning approach that incorporates geometric prior guidance, highlighting improved robustness in applying deep learning methods. The source code will be released upon acceptance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3356-3369"},"PeriodicalIF":8.4,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual Grounding","authors":"Minghong Xie;Mengzhao Wang;Huafeng Li;Yafei Zhang;Dapeng Tao;Zhengtao Yu","doi":"10.1109/TMM.2025.3535345","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535345","url":null,"abstract":"Visual grounding has attracted wide attention thanks to its broad application in various visual language tasks. Although visual grounding has made significant research progress, existing methods ignore the promotion effect of the association between text and image features at different hierarchies on cross-modal matching. This paper proposes a Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction Visual Grounding method. It first generates a mask through decoupled sentence phrases, and a text and image hierarchical matching mechanism is constructed, highlighting the role of association between different hierarchies in cross-modal matching. In addition, a corresponding target object position progressive correction strategy is defined based on the hierarchical matching mechanism to achieve accurate positioning for the target object described in the text. This method can continuously optimize and adjust the bounding box position of the target object as the certainty of the text description of the target object improves. This design explores the association between features at different hierarchies and highlights the role of features related to the target object and its position in target positioning. The proposed method is validated on different datasets through experiments, and its superiority is verified by the performance comparison with the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3979-3991"},"PeriodicalIF":8.4,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianqi Chen;Yilan Zhang;Zhengxia Zou;Keyan Chen;Zhenwei Shi
{"title":"Zero-Shot Image Harmonization With Generative Model Prior","authors":"Jianqi Chen;Yilan Zhang;Zhengxia Zou;Keyan Chen;Zhenwei Shi","doi":"10.1109/TMM.2025.3535343","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535343","url":null,"abstract":"We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images in existing methods. These methods, while showing promising results, involve significant training expenses and often struggle with generalization to unseen images. To this end, we introduce a fully modularized framework inspired by human behavior. Leveraging the reasoning capabilities of recent foundation models in language and vision, our approach comprises three main stages. Initially, we employ a pretrained vision-language model (VLM) to generate descriptions for the composite image. Subsequently, these descriptions guide the foreground harmonization direction of a text-to-image generative model (T2I). We refine text embeddings for enhanced representation of imaging conditions and employ self-attention and edge maps for structure preservation. Following each harmonization iteration, an evaluator determines whether to conclude or modify the harmonization direction. The resulting framework, mirroring human behavior, achieves harmonious results without the need for extensive training. We present compelling visual results across diverse scenes and objects, along with quantitative comparisons validating the effectiveness of our approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4494-4507"},"PeriodicalIF":9.7,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Pyramid-Structured Long-Range Dependencies for 3D Human Pose Estimation","authors":"Mingjie Wei;Xuemei Xie;Yutong Zhong;Guangming Shi","doi":"10.1109/TMM.2025.3535349","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535349","url":null,"abstract":"Action coordination in human structure is indispensable for the spatial constraints of 2D joints to recover 3D pose. Usually, action coordination is represented as a long-range dependence among body parts. However, there are two main challenges in modeling long-range dependencies. First, joints should not only be constrained by other individual joints but also be modulated by the body parts. Second, existing methods make networks deeper to learn dependencies between non-linked parts. They introduce uncorrelated noise and increase the model size. In this paper, we utilize a pyramid structure to better learn potential long-range dependencies. It can capture the correlation across joints and groups, which complements the context of the human sub-structure. In an effective cross-scale way, it captures the pyramid-structured long-range dependence. Specifically, we propose a novel Pyramid Graph Attention (PGA) module to capture long-range cross-scale dependencies. It concatenates information from various scales into a compact sequence, and then computes the correlation between scales in parallel. Combining PGA with graph convolution modules, we develop a Pyramid Graph Transformer (PGFormer) for 3D human pose estimation, which is a lightweight multi-scale transformer architecture. It encapsulates human sub-structures into self-attention by pooling. Extensive experiments show that our approach achieves lower error and smaller model size than state-of-the-art methods on Human3.6 M and MPI-INF-3DHP datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4684-4697"},"PeriodicalIF":9.7,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Repetitive Action Counting With Hybrid Temporal Relation Modeling","authors":"Kun Li;Xinge Peng;Dan Guo;Xun Yang;Meng Wang","doi":"10.1109/TMM.2025.3535385","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535385","url":null,"abstract":"Repetitive Action Counting (RAC) aims to count the number of repetitive actions occurring in videos. In the real world, repetitive actions have great diversity and bring numerous challenges (e.g., viewpoint changes, non-uniform periods, and action interruptions). Existing methods based on the temporal self-similarity matrix (TSSM) for RAC are trapped in the bottleneck of insufficient capturing action periods when applied to complicated daily videos. To tackle this issue, we propose a novel method named Hybrid Temporal Relation Modeling Network (HTRM-Net) to build diverse TSSM for RAC. The HTRM-Net mainly consists of three key components: bi-modal temporal self-similarity matrix modeling, random matrix dropping, and local temporal context modeling. Specifically, we construct temporal self-similarity matrices by bi-modal (self-attention and dual-softmax) operations, yielding diverse matrix representations from the combination of row-wise and column-wise correlations. To further enhance matrix representations, we propose incorporating a random matrix dropping module to guide channel-wise learning of the matrix explicitly. After that, we inject the local temporal context of video frames and the learned matrix into temporal correlation modeling, which can make the model robust enough to cope with error-prone situations, such as action interruption. Finally, a multi-scale matrix fusion module is designed to aggregate temporal correlations adaptively in multi-scale matrices. Extensive experiments across intra- and cross-datasets demonstrate that the proposed method not only outperforms current state-of-the-art methods and but also exhibits robust capabilities in accurately counting repetitive actions in unseen action categories. Notably, our method surpasses the classical TransRAC method by 20.04% in MAE and 22.76% in OBO.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3844-3855"},"PeriodicalIF":8.4,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Centra-Net: A Centralized Network for Visual Localization Spanning Multiple Scenes","authors":"Zhiqiang Jiang;Kun Dai;Ke Wang;Tao Xie;Zhendong Fan;Ruifeng Li;Peng Kang;Lijun Zhao","doi":"10.1109/TMM.2025.3535388","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535388","url":null,"abstract":"We present Centra-Net, a centralized network that concurrently optimizes visual localization over numerous scenes under heterogeneous dataset domains. Centra-Net exemplifies storage efficiency by amalgamating multiple models with task-shared parameters into a singular cohesive structure. Technically, we develop a <bold>basic feature extraction unit (BFEU)</b> with two parallel branches: one dedicated to local feature extraction and the other adept at adaptively generating a task-specific attention mask for feature calibration, thus bolstering its feature extraction capability across diverse scenes. Based on the BFEU, we introduce a <bold>filter-wise sharing mechanism (FSM)</b> that adaptively determines parameter sharing within the unit, thus facilitating fine-grained parameter allocation. The key insight of FSM resides in reconceptualizing the parameter sharing of the unit as a learnable paradigm, enabling the determination of shared parameters to be made post-training. Finally, we suggest a <bold>complexity-prioritized gradient algorithm (CPGA)</b> that capitalizes on task complexity to attain a harmonious learning space for various tasks, thus safeguarding optimal performances across all tasks. Through rigorous experiments on numerous benchmarks, Centra-Net demonstrates a notable edge over existing state-of-the-art works while operating with a significantly reduced parameter footprint.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4698-4712"},"PeriodicalIF":9.7,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}