IEEE Transactions on Multimedia最新文献_第9页

SeaCap: Multi-Sight Embedding and Alignment for One-Stage Image Captioner SeaCap：用于一级图像捕获器的多视点嵌入和对齐

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-20 DOI: 10.1109/TMM.2025.3535303

Bo Wang;Zhao Zhang;Mingbo Zhao;Xiaojie Jin;Mingliang Xu;Meng Wang

{"title":"SeaCap: Multi-Sight Embedding and Alignment for One-Stage Image Captioner","authors":"Bo Wang;Zhao Zhang;Mingbo Zhao;Xiaojie Jin;Mingliang Xu;Meng Wang","doi":"10.1109/TMM.2025.3535303","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535303","url":null,"abstract":"Recent mainstream image captioning methods usually adopt two-stage captioners, i.e., calculating the object features of the given image by a pre-trained detector and then feeding them into a language model to generate the descriptive sentences. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance of the captioners, because the object features learned from the detection task are suboptimal representations and cannot provide all the necessary information for subsequent sentence generation. Besides, the object features are usually represented by the last pooling features of the detector that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner using dynamic multi-sight embedding and alignment, called SeaCap, which directly transforms input images into descriptive sentences in one stage to eliminate the information gap. Specifically, to obtain rich features, we use the Swin Transformer to capture multi-level features, followed by a sights alignment module to alleviate the vision confusion, and then feed them into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. As a result, SeaCap can obtain rich and useful information to improve the performance of the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30 K datasets verified the superior performance of our method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3411-3425"},"PeriodicalIF":8.4,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Information Compensation Framework for Zero-Shot Skeleton-Based Action Recognition 基于零射击骨架的动作识别信息补偿框架

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI: 10.1109/TMM.2025.3543004

Haojun Xu;Yan Gao;Jie Li;Xinbo Gao

{"title":"An Information Compensation Framework for Zero-Shot Skeleton-Based Action Recognition","authors":"Haojun Xu;Yan Gao;Jie Li;Xinbo Gao","doi":"10.1109/TMM.2025.3543004","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543004","url":null,"abstract":"Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial distributions. However, these methods extract semantic features simply. They ignore that proper prompt design for rich and fine-grained action cues can provide robust representation space clustering. In order to alleviate the problem of insufficient information available for skeleton sequences, we design an information compensation learning framework from an information-theoretic perspective to improve zero-shot action recognition accuracy with a multi-granularity semantic interaction mechanism. Inspired by ensemble learning, we propose a multi-level alignment (MLA) approach to compensate information for action classes. MLA aligns multi-granularity embeddings with visual embedding through a multi-head scoring mechanism to distinguish semantically similar action names and visually similar actions. Furthermore, we introduce a new loss function sampling method to obtain a tight and robust representation. Finally, these multi-granularity semantic embeddings are synthesized to form a proper decision surface for classification. Significant action recognition performance is achieved when evaluated on the challenging NTU RGB+D, NTU RGB+D 120, and PKU-MMD benchmarks and validate that multi-granularity semantic features facilitate the differentiation of action clusters with similar visual features.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4882-4894"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Global Spatial-Temporal Information-Based Residual ConvLSTM for Video Space-Time Super-Resolution 基于全局时空信息的视频时空超分辨率残差ConvLSTM

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI: 10.1109/TMM.2025.3542970

Congrui Fu;Hui Yuan;Shiqi Jiang;Guanghui Zhang;Liquan Shen;Raouf Hamzaoui

{"title":"Global Spatial-Temporal Information-Based Residual ConvLSTM for Video Space-Time Super-Resolution","authors":"Congrui Fu;Hui Yuan;Shiqi Jiang;Guanghui Zhang;Liquan Shen;Raouf Hamzaoui","doi":"10.1109/TMM.2025.3542970","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542970","url":null,"abstract":"By converting low-frame-rate, low-resolution videos into high-frame-rate, high-resolution ones, space-time video super-resolution techniques can enhance visual experiences and facilitate more efficient information dissemination. We propose a convolutional neural network (CNN) for space-time video super-resolution, namely GIRNet. Our method combines long-term global information and short-term local information from the video to better extract complete and accurate spatial-temporal information. To generate highly accurate features and thus improve performance, the proposed network integrates a feature-level temporal interpolation module with deformable convolutions and a global spatial-temporal information-based residual convolutional long short-term memory (convLSTM) module. In the feature-level temporal interpolation module, we leverage deformable convolution, which adapts to deformations and scale variations of objects across different scene locations. This provides a more efficient solution than conventional convolution for extracting features from moving objects. Our network effectively uses forward and backward feature information to determine inter-frame offsets, leading to the direct generation of interpolated frame features. In the global spatial-temporal information-based residual convLSTM module, the first convLSTM is used to derive global spatial-temporal information from the input features, and the second convLSTM uses the previously computed global spatial-temporal information feature as its initial cell state. This second convLSTM adopts residual connections to preserve spatial information, thereby enhancing the output features. Experiments on the Vimeo90 K dataset show that the proposed method outperforms open source state-of-the-art techniques in peak signal-to-noise-ratio (by 1.45 dB, 1.14 dB, and 0.2 dB over STARnet, TMNet, and 3DAttGAN, respectively), structural similarity index(by 0.027, 0.023, and 0.006 over STARnet, TMNet, and 3DAttGAN, respectively), and visual quality.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5212-5224"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Language Knowledge-Assisted Representation Learning for Skeleton-Based Action Recognition 基于骨架的动作识别的语言知识辅助表示学习

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI: 10.1109/TMM.2025.3543034

Haojun Xu;Yan Gao;Zheng Hui;Jie Li;Xinbo Gao

{"title":"Language Knowledge-Assisted Representation Learning for Skeleton-Based Action Recognition","authors":"Haojun Xu;Yan Gao;Zheng Hui;Jie Li;Xinbo Gao","doi":"10.1109/TMM.2025.3543034","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543034","url":null,"abstract":"How humans understand and recognize the actions of others is a complex neuroscientific problem that involves a combination of cognitive mechanisms and neural networks. Research has shown that humans have brain areas that recognize actions that process top-down attentional information, such as the temporoparietal association area. Also, humans have brain regions dedicated to understanding the minds of others and analyzing their intentions, such as the medial prefrontal cortex of the temporal lobe. Skeleton-based action recognition creates mappings for the complex connections between the human skeleton movement patterns and behaviors. Although existing studies encoded meaningful node relationships and synthesized action representations for classification with good results, few of them considered incorporating a priori knowledge to aid potential representation learning for better performance. LA-GCN proposes a graph convolution network using large-scale language models (LLM) knowledge assistance. First, the LLM knowledge is mapped into a priori global relationship (GPR) topology and a priori category relationship (CPR) topology between nodes. The GPR guides the generation of new “bone” representations, aiming to emphasize essential node information from the data level. The CPR mapping simulates category prior knowledge in human brain regions, encoded by the PC-AC module and used to add additional supervision—forcing the model to learn class-distinguishable features. In addition, to improve information transfer efficiency in topology modeling, we propose multi-hop attention graph convolution. It aggregates each node's k-order neighbor simultaneously to speed up model convergence. LA-GCN reaches state-of-the-art on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5784-5799"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DCDL: Dual Causal Disentangled Learning for Zero-Shot Sketch-Based Image Retrieval 基于零镜头草图的图像检索的双因果解纠缠学习

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI: 10.1109/TMM.2025.3543035

Qiang Li;Shihao Wang;Wei Zhang;Shaojin Bai;Weizhi Nie;Anan Liu

{"title":"DCDL: Dual Causal Disentangled Learning for Zero-Shot Sketch-Based Image Retrieval","authors":"Qiang Li;Shihao Wang;Wei Zhang;Shaojin Bai;Weizhi Nie;Anan Liu","doi":"10.1109/TMM.2025.3543035","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543035","url":null,"abstract":"Zero-shot sketch-based image retrieval (ZS-SBIR) is a challenging task that hinges on overcoming the cross-domain differences between sketches and images. Previous methods primarily address cross-domain differences by creating a common embedding space, improving final retrieval results. However, most previous approaches have overlooked a critical aspect: sketch-based image retrieval task actually requires only the cross-domain invariant information relevant to the retrieval. Irrelevant information (such as posture, expression, background, and specificity) may detract from retrieval accuracy. In addition, most previous methods perform well on traditional SBIR datasets but lack corresponding research on generalization and extensibility in the face of more diverse and complex data. To address these issues, we propose a Dual Causal Disentangled Learning (DCDL) for ZS-SBIR. This approach can mitigate the negative impact of irrelevant features by separating retrieval-relevant features in the latent variable space. Specifically, we constructed a causal disentanglement model using two Variational Autoencoders (VAE), each applied to the sketch and image domains, to obtain disentangled variables with exchangeable attributes. Our framework effectively integrates causal intervention with disentangled representation learning, enabling a clearer separation of cross-domain retrieval-relevant and intra-class irrelevant features, which can be recombined into new reconstructed samples. Concurrently, we designed a Dual Alignment Module (DAM), leveraging the accurate and comprehensive semantic features provided by a text encoder pre-trained on large-scale datasets to supplement semantic associations and align disentangled retrieval-relevant features. The Dual Alignment Module enhances the model's ability to generalize across diverse datasets by effectively aligning retrieval-relevant information from different domains. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance on the Sketchy and TU–Berlin datasets. Additionally, more experiments on larger scale dataset QuickDraw, fine-grained datasets, Shoe-V2 and Chair-V2, as well as an inter-dataset further validate the generalization and extensibility of DCDL.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5575-5590"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Threefold Encoder Interaction: Hierarchical Multi-Grained Semantic Alignment for Cross-Modal Food Retrieval 三重编码器交互：跨模式食物检索的分层多粒度语义对齐

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI: 10.1109/TMM.2025.3543067

Qi Wang;Dong Wang;Weidong Min;Di Gai;Qing Han;Cheng Zha;Yuling Zhong

{"title":"Threefold Encoder Interaction: Hierarchical Multi-Grained Semantic Alignment for Cross-Modal Food Retrieval","authors":"Qi Wang;Dong Wang;Weidong Min;Di Gai;Qing Han;Cheng Zha;Yuling Zhong","doi":"10.1109/TMM.2025.3543067","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543067","url":null,"abstract":"Current cross-modal food retrieval approaches focus mainly on the global visual appearance of food without explicitly considering multi-grained information. Additionally, direct calculation of the global similarity of image-recipe pairs is not particularly effective in terms of latent alignment, which suffers from mismatch during the mutual image-recipe retrieval process. This paper proposes a threefold encoder interaction (TEI) cross-modal food retrieval framework to maintain the multi-granularity of food images and the multi-levels of textual recipes to address the aforementioned challenges. The TEI framework comprises an image encoder, a recipe encoder, and a multi-grained interaction encoder. We simultaneously propose a multi-grained relation-aware attention (MRA) embedded in the multi-grained interaction encoder to capture multi-grained food visual features. The multi-grained interaction similarity scores are calculated to better establish the multi-grained correlation between recipe and image entities based on the extracted hierarchical textual and multi-grained visual features. Finally, a hierarchical multi-grained semantic alignment loss is designed to supervise the whole process of cross-modal training using the multi-grained interaction similarity scores. Extensive qualitative and quantitative experiments on the Recipe1M dataset have demonstrated that the proposed TEI framework achieves multi-grained semantic alignment between image and text modalities and is superior to other state-of-the-art methods in cross-modal food retrieval tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2848-2862"},"PeriodicalIF":8.4,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GAN Prior-Enhanced Novel View Synthesis From Monocular Degraded Images 单眼退化图像的GAN先验增强新视图合成

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI: 10.1109/TMM.2025.3542963

Kehua Guo;Zheng Wu;Xianhong Wen;Shaojun Guo;Zhipeng Xi;Tianyu Chen

{"title":"GAN Prior-Enhanced Novel View Synthesis From Monocular Degraded Images","authors":"Kehua Guo;Zheng Wu;Xianhong Wen;Shaojun Guo;Zhipeng Xi;Tianyu Chen","doi":"10.1109/TMM.2025.3542963","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542963","url":null,"abstract":"With the escalating demand for three-dimensional visual applications such as gaming, virtual reality, and autonomous driving, novel view synthesis has become a critical area of research. Current methods mainly depend on multiple views of the same subject to achieve satisfactory results, but there is often a significant lack of available data. Typically, only a single degraded image is available for reconstruction, which may be affected by occlusion, low resolution, or absence of color information. To overcome this limitation, we propose a two-stage feature matching approach designed specifically for single degraded images, leading to the synthesis of high-quality novel perspective images. This method involves the sequential use of an encoder for feature extraction followed by the fine-tuning of a generator for feature matching. Additionally, the integration of an information filtering module proposed by us during the GAN inversion process helps eliminate misleading information present in degraded images, thereby correcting the inversion direction. Extensive experimental results show that our method outperforms existing state-of-the-art single-view novel view synthesis techniques in handling challenges like occluded, grayscale, and low-resolution images. Moreover, the efficacy of our method remains unparalleled even when aforementioned method integrated with image restoration algorithms.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5352-5362"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mask-Aware Light Field De-Occlusion With Gated Feature Aggregation and Texture-Semantic Attention 基于门控特征聚合和纹理语义关注的掩模感知光场去遮挡

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI: 10.1109/TMM.2025.3543048

Jieyu Chen;Ping An;Xinpeng Huang;Yilei Chen;Chao Yang;Liquan Shen

{"title":"Mask-Aware Light Field De-Occlusion With Gated Feature Aggregation and Texture-Semantic Attention","authors":"Jieyu Chen;Ping An;Xinpeng Huang;Yilei Chen;Chao Yang;Liquan Shen","doi":"10.1109/TMM.2025.3543048","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543048","url":null,"abstract":"A light field image records rich information of a scene from multiple views, thereby providing complementary information for occlusion removal. However, current occlusion removal methods have several issues: 1) inefficient exploitation of spatial and angular complementary information among views; 2) indistinguishable treatment of pixels from foreground occlusion and background; and 3) insufficient exploration of spatial detail supplementation. Therefore, in this article, we propose a mask-aware de-occlusion network (MANet). Specifically, MANet is a joint training network that integrates the occlusion mask predictor (OMP) and the occlusion remover (OR). First, OMP is proposed to provide the location of occluded regions for OR, as the occlusion removal task is ill-posed without occluded region localization. In OR, we introduce gated spatial-angular feature aggregation, which uses a soft gating mechanism to focus on spatial-angular interaction features in non-occluded regions, extracting effective aggregated features specific to the de-occlusion. Then, we design a complementary strategy to fully utilize spatial-angular information among views. Finally, we propose texture-semantic attention to improve the performance of detail generation. Experimental results demonstrate the superiority of MANet, with substantial improvements in both PSNR and SSIM metrics. Moreover, MANet stands out with an efficient parameter count of 2.4 M, making it a promising solution for real-world applications in public safety and security surveillance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5296-5311"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DLS-HCAN: Duplex Label Smoothing Based Hierarchical Context-Aware Network for Fine-Grained 3D Shape Classification DLS-HCAN：基于双标签平滑的细粒度三维形状分类分层上下文感知网络

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI: 10.1109/TMM.2025.3543077

Shaojin Bai;Liang Zheng;Jing Bai;Xiangyu Ma

{"title":"DLS-HCAN: Duplex Label Smoothing Based Hierarchical Context-Aware Network for Fine-Grained 3D Shape Classification","authors":"Shaojin Bai;Liang Zheng;Jing Bai;Xiangyu Ma","doi":"10.1109/TMM.2025.3543077","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543077","url":null,"abstract":"Fine-grained 3D shape classification (FGSC) has garnered significant attention recently and has made notable advancements. However, due to high inter-class similarity and intra-class diversity, it is still a challenge for existing methods to capture subtle differences between different subcategories for FGSC. On the one hand, one-hot labels in loss function are too hard to describe the above data characteristics, and on the other hand, local details are submerged in the global features extraction process and final network constraints, impacting classification results. In this paper, we propose a duplex label smoothing-based hierarchical context-aware network for fine-grained 3D shape classification, named DLS-HCAN. Specifically, DLS-HCAN firstly employs a hierarchical context-aware network (HCAN), in which the intra-view context attention mechanism (intra-ATT) and the inter-view context multilayer perceptron (inter-MLP) are designed to focus on and discern the beneficial local details. Subsequently, we propose a novel duplex label smoothing (DLS) regularization in which shape-level and view-level smooth labels are separately applied in two improved loss functions, adapting to the fine-grained data characteristics and considering the varying uniqueness of different views. Notably, our approach does not require additional annotation information. Experimental results and comparison with state-of-the-art methods demonstrate the superiority of our proposed DLS-HCAN for FGSC. In addition, our approach also achieves comparable performance for the coarse-grained dataset on ModelNet40.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5815-5830"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Viewport Prediction With Unsupervised Multiscale Causal Representation Learning for Virtual Reality Video Streaming 基于无监督多尺度因果表示学习的虚拟现实视频流视口预测

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI: 10.1109/TMM.2025.3543087

Yingjie Liu;Dan Wang;Bin Song

{"title":"Viewport Prediction With Unsupervised Multiscale Causal Representation Learning for Virtual Reality Video Streaming","authors":"Yingjie Liu;Dan Wang;Bin Song","doi":"10.1109/TMM.2025.3543087","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543087","url":null,"abstract":"The rise of the metaverse has driven the rapid development of various applications, such as Virtual Reality (VR) and Augmented Reality (AR). As a form of multimedia in the metaverse, VR video streaming (a.k.a., VR spherical video streaming and 360<inline-formula><tex-math>$^{circ }$</tex-math></inline-formula> video streaming) can provide users with a 360<inline-formula><tex-math>$^{circ }$</tex-math></inline-formula> immersive experience. Generally, transmitting VR video requires far more bandwidth than regular videos, which greatly strains existing network transmission. Predicting and selectively streaming VR video in the users' viewports in advance can reduce bandwidth consumption and system latency. However, existing methods either consider only historical viewport-based prediction methods or predict viewports by correlations between visual features of video frames, making it hard to adapt to the dynamics of users and video content. In the meantime, spurious correlations between visual features lead to inaccurate and unreliable prediction results. Hence, we propose an unsupervised multiscale causal representation learning (UMCRL)-based method to predict viewports in VR video streaming, including user preference-based and video content-based viewport prediction models. The former is designed by a position predictor to predict the future users' viewports based on their historical viewports in multiple video frames to adapt to users' dynamic preferences. The latter achieves unsupervised multiscale causal representation learning through an asymmetric causal regressor, used to infer the causalities between local and global-local visual features in video frames, thereby helping the model understand the contextual information in the videos. We embed the causalities in the transformer decoder via causal self-attention for predicting the users' viewports, adapting to the dynamic changes of video content. Finally, combining the results of the two aforementioned models yields the final prediction of the users' viewports. In addition, the QoE of users is satisfied by assigning different bitrates to the tiles in the viewport through a pyramid-based bitrate allocation. The experimental results verify the effectiveness of the method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4752-4764"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0