{"title":"An Information Compensation Framework for Zero-Shot Skeleton-Based Action Recognition","authors":"Haojun Xu;Yan Gao;Jie Li;Xinbo Gao","doi":"10.1109/TMM.2025.3543004","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543004","url":null,"abstract":"Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial distributions. However, these methods extract semantic features simply. They ignore that proper prompt design for rich and fine-grained action cues can provide robust representation space clustering. In order to alleviate the problem of insufficient information available for skeleton sequences, we design an information compensation learning framework from an information-theoretic perspective to improve zero-shot action recognition accuracy with a multi-granularity semantic interaction mechanism. Inspired by ensemble learning, we propose a multi-level alignment (MLA) approach to compensate information for action classes. MLA aligns multi-granularity embeddings with visual embedding through a multi-head scoring mechanism to distinguish semantically similar action names and visually similar actions. Furthermore, we introduce a new loss function sampling method to obtain a tight and robust representation. Finally, these multi-granularity semantic embeddings are synthesized to form a proper decision surface for classification. Significant action recognition performance is achieved when evaluated on the challenging NTU RGB+D, NTU RGB+D 120, and PKU-MMD benchmarks and validate that multi-granularity semantic features facilitate the differentiation of action clusters with similar visual features.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4882-4894"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Global Spatial-Temporal Information-Based Residual ConvLSTM for Video Space-Time Super-Resolution","authors":"Congrui Fu;Hui Yuan;Shiqi Jiang;Guanghui Zhang;Liquan Shen;Raouf Hamzaoui","doi":"10.1109/TMM.2025.3542970","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542970","url":null,"abstract":"By converting low-frame-rate, low-resolution videos into high-frame-rate, high-resolution ones, space-time video super-resolution techniques can enhance visual experiences and facilitate more efficient information dissemination. We propose a convolutional neural network (CNN) for space-time video super-resolution, namely GIRNet. Our method combines long-term global information and short-term local information from the video to better extract complete and accurate spatial-temporal information. To generate highly accurate features and thus improve performance, the proposed network integrates a feature-level temporal interpolation module with deformable convolutions and a global spatial-temporal information-based residual convolutional long short-term memory (convLSTM) module. In the feature-level temporal interpolation module, we leverage deformable convolution, which adapts to deformations and scale variations of objects across different scene locations. This provides a more efficient solution than conventional convolution for extracting features from moving objects. Our network effectively uses forward and backward feature information to determine inter-frame offsets, leading to the direct generation of interpolated frame features. In the global spatial-temporal information-based residual convLSTM module, the first convLSTM is used to derive global spatial-temporal information from the input features, and the second convLSTM uses the previously computed global spatial-temporal information feature as its initial cell state. This second convLSTM adopts residual connections to preserve spatial information, thereby enhancing the output features. Experiments on the Vimeo90 K dataset show that the proposed method outperforms open source state-of-the-art techniques in peak signal-to-noise-ratio (by 1.45 dB, 1.14 dB, and 0.2 dB over STARnet, TMNet, and 3DAttGAN, respectively), structural similarity index(by 0.027, 0.023, and 0.006 over STARnet, TMNet, and 3DAttGAN, respectively), and visual quality.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5212-5224"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language Knowledge-Assisted Representation Learning for Skeleton-Based Action Recognition","authors":"Haojun Xu;Yan Gao;Zheng Hui;Jie Li;Xinbo Gao","doi":"10.1109/TMM.2025.3543034","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543034","url":null,"abstract":"How humans understand and recognize the actions of others is a complex neuroscientific problem that involves a combination of cognitive mechanisms and neural networks. Research has shown that humans have brain areas that recognize actions that process top-down attentional information, such as the temporoparietal association area. Also, humans have brain regions dedicated to understanding the minds of others and analyzing their intentions, such as the medial prefrontal cortex of the temporal lobe. Skeleton-based action recognition creates mappings for the complex connections between the human skeleton movement patterns and behaviors. Although existing studies encoded meaningful node relationships and synthesized action representations for classification with good results, few of them considered incorporating a priori knowledge to aid potential representation learning for better performance. LA-GCN proposes a graph convolution network using large-scale language models (LLM) knowledge assistance. First, the LLM knowledge is mapped into a priori global relationship (GPR) topology and a priori category relationship (CPR) topology between nodes. The GPR guides the generation of new “bone” representations, aiming to emphasize essential node information from the data level. The CPR mapping simulates category prior knowledge in human brain regions, encoded by the PC-AC module and used to add additional supervision—forcing the model to learn class-distinguishable features. In addition, to improve information transfer efficiency in topology modeling, we propose multi-hop attention graph convolution. It aggregates each node's k-order neighbor simultaneously to speed up model convergence. LA-GCN reaches state-of-the-art on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5784-5799"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Threefold Encoder Interaction: Hierarchical Multi-Grained Semantic Alignment for Cross-Modal Food Retrieval","authors":"Qi Wang;Dong Wang;Weidong Min;Di Gai;Qing Han;Cheng Zha;Yuling Zhong","doi":"10.1109/TMM.2025.3543067","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543067","url":null,"abstract":"Current cross-modal food retrieval approaches focus mainly on the global visual appearance of food without explicitly considering multi-grained information. Additionally, direct calculation of the global similarity of image-recipe pairs is not particularly effective in terms of latent alignment, which suffers from mismatch during the mutual image-recipe retrieval process. This paper proposes a threefold encoder interaction (TEI) cross-modal food retrieval framework to maintain the multi-granularity of food images and the multi-levels of textual recipes to address the aforementioned challenges. The TEI framework comprises an image encoder, a recipe encoder, and a multi-grained interaction encoder. We simultaneously propose a multi-grained relation-aware attention (MRA) embedded in the multi-grained interaction encoder to capture multi-grained food visual features. The multi-grained interaction similarity scores are calculated to better establish the multi-grained correlation between recipe and image entities based on the extracted hierarchical textual and multi-grained visual features. Finally, a hierarchical multi-grained semantic alignment loss is designed to supervise the whole process of cross-modal training using the multi-grained interaction similarity scores. Extensive qualitative and quantitative experiments on the Recipe1M dataset have demonstrated that the proposed TEI framework achieves multi-grained semantic alignment between image and text modalities and is superior to other state-of-the-art methods in cross-modal food retrieval tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2848-2862"},"PeriodicalIF":8.4,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GAN Prior-Enhanced Novel View Synthesis From Monocular Degraded Images","authors":"Kehua Guo;Zheng Wu;Xianhong Wen;Shaojun Guo;Zhipeng Xi;Tianyu Chen","doi":"10.1109/TMM.2025.3542963","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542963","url":null,"abstract":"With the escalating demand for three-dimensional visual applications such as gaming, virtual reality, and autonomous driving, novel view synthesis has become a critical area of research. Current methods mainly depend on multiple views of the same subject to achieve satisfactory results, but there is often a significant lack of available data. Typically, only a single degraded image is available for reconstruction, which may be affected by occlusion, low resolution, or absence of color information. To overcome this limitation, we propose a two-stage feature matching approach designed specifically for single degraded images, leading to the synthesis of high-quality novel perspective images. This method involves the sequential use of an encoder for feature extraction followed by the fine-tuning of a generator for feature matching. Additionally, the integration of an information filtering module proposed by us during the GAN inversion process helps eliminate misleading information present in degraded images, thereby correcting the inversion direction. Extensive experimental results show that our method outperforms existing state-of-the-art single-view novel view synthesis techniques in handling challenges like occluded, grayscale, and low-resolution images. Moreover, the efficacy of our method remains unparalleled even when aforementioned method integrated with image restoration algorithms.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5352-5362"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mask-Aware Light Field De-Occlusion With Gated Feature Aggregation and Texture-Semantic Attention","authors":"Jieyu Chen;Ping An;Xinpeng Huang;Yilei Chen;Chao Yang;Liquan Shen","doi":"10.1109/TMM.2025.3543048","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543048","url":null,"abstract":"A light field image records rich information of a scene from multiple views, thereby providing complementary information for occlusion removal. However, current occlusion removal methods have several issues: 1) inefficient exploitation of spatial and angular complementary information among views; 2) indistinguishable treatment of pixels from foreground occlusion and background; and 3) insufficient exploration of spatial detail supplementation. Therefore, in this article, we propose a mask-aware de-occlusion network (MANet). Specifically, MANet is a joint training network that integrates the occlusion mask predictor (OMP) and the occlusion remover (OR). First, OMP is proposed to provide the location of occluded regions for OR, as the occlusion removal task is ill-posed without occluded region localization. In OR, we introduce gated spatial-angular feature aggregation, which uses a soft gating mechanism to focus on spatial-angular interaction features in non-occluded regions, extracting effective aggregated features specific to the de-occlusion. Then, we design a complementary strategy to fully utilize spatial-angular information among views. Finally, we propose texture-semantic attention to improve the performance of detail generation. Experimental results demonstrate the superiority of MANet, with substantial improvements in both PSNR and SSIM metrics. Moreover, MANet stands out with an efficient parameter count of 2.4 M, making it a promising solution for real-world applications in public safety and security surveillance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5296-5311"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DLS-HCAN: Duplex Label Smoothing Based Hierarchical Context-Aware Network for Fine-Grained 3D Shape Classification","authors":"Shaojin Bai;Liang Zheng;Jing Bai;Xiangyu Ma","doi":"10.1109/TMM.2025.3543077","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543077","url":null,"abstract":"Fine-grained 3D shape classification (FGSC) has garnered significant attention recently and has made notable advancements. However, due to high inter-class similarity and intra-class diversity, it is still a challenge for existing methods to capture subtle differences between different subcategories for FGSC. On the one hand, one-hot labels in loss function are too hard to describe the above data characteristics, and on the other hand, local details are submerged in the global features extraction process and final network constraints, impacting classification results. In this paper, we propose a duplex label smoothing-based hierarchical context-aware network for fine-grained 3D shape classification, named DLS-HCAN. Specifically, DLS-HCAN firstly employs a hierarchical context-aware network (HCAN), in which the intra-view context attention mechanism (intra-ATT) and the inter-view context multilayer perceptron (inter-MLP) are designed to focus on and discern the beneficial local details. Subsequently, we propose a novel duplex label smoothing (DLS) regularization in which shape-level and view-level smooth labels are separately applied in two improved loss functions, adapting to the fine-grained data characteristics and considering the varying uniqueness of different views. Notably, our approach does not require additional annotation information. Experimental results and comparison with state-of-the-art methods demonstrate the superiority of our proposed DLS-HCAN for FGSC. In addition, our approach also achieves comparable performance for the coarse-grained dataset on ModelNet40.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5815-5830"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Viewport Prediction With Unsupervised Multiscale Causal Representation Learning for Virtual Reality Video Streaming","authors":"Yingjie Liu;Dan Wang;Bin Song","doi":"10.1109/TMM.2025.3543087","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543087","url":null,"abstract":"The rise of the metaverse has driven the rapid development of various applications, such as Virtual Reality (VR) and Augmented Reality (AR). As a form of multimedia in the metaverse, VR video streaming (a.k.a., VR spherical video streaming and 360<inline-formula><tex-math>$^{circ }$</tex-math></inline-formula> video streaming) can provide users with a 360<inline-formula><tex-math>$^{circ }$</tex-math></inline-formula> immersive experience. Generally, transmitting VR video requires far more bandwidth than regular videos, which greatly strains existing network transmission. Predicting and selectively streaming VR video in the users' viewports in advance can reduce bandwidth consumption and system latency. However, existing methods either consider only historical viewport-based prediction methods or predict viewports by correlations between visual features of video frames, making it hard to adapt to the dynamics of users and video content. In the meantime, spurious correlations between visual features lead to inaccurate and unreliable prediction results. Hence, we propose an unsupervised multiscale causal representation learning (UMCRL)-based method to predict viewports in VR video streaming, including user preference-based and video content-based viewport prediction models. The former is designed by a position predictor to predict the future users' viewports based on their historical viewports in multiple video frames to adapt to users' dynamic preferences. The latter achieves unsupervised multiscale causal representation learning through an asymmetric causal regressor, used to infer the causalities between local and global-local visual features in video frames, thereby helping the model understand the contextual information in the videos. We embed the causalities in the transformer decoder via causal self-attention for predicting the users' viewports, adapting to the dynamic changes of video content. Finally, combining the results of the two aforementioned models yields the final prediction of the users' viewports. In addition, the QoE of users is satisfied by assigning different bitrates to the tiles in the viewport through a pyramid-based bitrate allocation. The experimental results verify the effectiveness of the method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4752-4764"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Multi-Scale Language Reinforcement for Multimodal Named Entity Recognition","authors":"Enping Li;Tianrui Li;Huaishao Luo;Jielei Chu;Lixin Duan;Fengmao Lv","doi":"10.1109/TMM.2025.3543105","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543105","url":null,"abstract":"Over the recent years, multimodal named entity recognition has gained increasing attentions due to its wide applications in social media. The key factor of multimodal named entity recognition is to effectively fuse information of different modalities. Existing works mainly focus on reinforcing textual representations by fusing image features via the cross-modal attention mechanism. However, these works are limited in reinforcing the text modality at the token level. As a named entity usually contains several tokens, modeling token-level inter-modal interactions is suboptimal for the multimodal named entity recognition problem. In this work, we propose a multimodal named entity recognition approach dubbed Adaptive Multi-scale Language Reinforcement (AMLR) to implement entity-level language reinforcement. To this end, our model first expands token-level textual representations into multi-scale textual representations which are composed of language units of different lengths. After that, the visual information reinforces the language modality by modeling the cross-modal attention between images and expanded multi-scale textual representations. Unlike existing token-level language reinforcement methods, the word sequences of named entities can be directly interacted with the visual features as a whole, making the modeled cross-modal correlations more reasonable. Although the underlying entity is not given, the training procedure can encourage the relevant image contents to adaptively attend to the appropriate language units, making our approach not rely on the pipeline design. Comprehensive evaluation results on two public Twitter datasets clearly demonstrate the superiority of our proposed model.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5312-5323"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SyNet: A Synergistic Network for 3D Object Detection Through Geometric-Semantic-Based Multi-Interaction Fusion","authors":"Xiaoqin Zhang;Kenan Bi;Sixian Chan;Shijian Lu;Xiaolong Zhou","doi":"10.1109/TMM.2025.3542993","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542993","url":null,"abstract":"Driven by rising demands in autonomous driving, robotics, <italic>etc.</i>, 3D object detection has recently achieved great advancement by fusing optical images and LiDAR point data. On the other hand, most existing optical-LiDAR fusion methods straightly overlay RGB images and point clouds without adequately exploiting the synergy between them, leading to suboptimal fusion and 3D detection performance. Additionally, they often suffer from limited localization accuracy without proper balancing of global and local object information. To address this issue, we design a synergistic network (SyNet) that fuses geometric information, semantic information, as well as global and local information of objects for robust and accurate 3D detection. The SyNet captures synergies between optical images and LiDAR point clouds from three perspectives. The first is geometric, which derives high-quality depth by projecting point clouds onto multi-view images, enriching optical RGB images with 3D spatial information for a more accurate interpretation of image semantics. The second is semantic, which voxelizes point clouds and establishes correspondences between the derived voxels and image pixels, enriching 3D point clouds with semantic information for more accurate 3D detection. The third is balancing local and global object information, which introduces deformable self-attention and cross-attention to process the two types of complementary information in parallel for more accurate object localization. Extensive experiments show that SyNet achieves 70.7% mAP and 73.5% NDS on the nuScenes test set, demonstrating its effectiveness and superiority as compared with the state-of-the-art.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4950-4960"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}