IEEE Transactions on Multimedia最新文献_第8页

AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation 弱监督少镜头语义分割的自适应频率感知网络

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-27 DOI: 10.1109/TMM.2025.3535348

Jiaqi Ma;Guo-Sen Xie;Fang Zhao;Zechao Li

{"title":"AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation","authors":"Jiaqi Ma;Guo-Sen Xie;Fang Zhao;Zechao Li","doi":"10.1109/TMM.2025.3535348","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535348","url":null,"abstract":"Few-shot learning aims to recognize novel concepts by leveraging prior knowledge learned from a few samples. However, for visually intensive tasks such as few-shot semantic segmentation, pixel-level annotations are time-consuming and costly. Therefore, in this paper, we utilize the more challenging image-level annotations and propose an adaptive frequency-aware network (AFANet) for weakly-supervised few-shot semantic segmentation (WFSS). Specifically, we first propose a cross-granularity frequency-aware module (CFM) that decouples RGB images into high-frequency and low-frequency distributions and further optimizes semantic structural information by realigning them. Unlike most existing WFSS methods using the textual information from the multi-modal language-vision model, e.g., CLIP, in an offline learning manner, we further propose a CLIP-guided spatial-adapter module (CSM), which performs spatial domain adaptive transformation on textual information through online learning, thus providing enriched cross-modal semantic information for CFM. Extensive experiments on the Pascal-5<sup>i</sup> and COCO-20<sup>i</sup> datasets demonstrate that AFANet has achieved state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4018-4028"},"PeriodicalIF":8.4,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graph Contrastive Learning for Fusion of Graph Structure and Attribute Information 图结构与属性信息融合的图对比学习

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-24 DOI: 10.1109/TMM.2025.3542984

Zhuomin Liang;Liang Bai;Xian Yang;Jiye Liang

{"title":"Graph Contrastive Learning for Fusion of Graph Structure and Attribute Information","authors":"Zhuomin Liang;Liang Bai;Xian Yang;Jiye Liang","doi":"10.1109/TMM.2025.3542984","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542984","url":null,"abstract":"Graph Contrastive Learning (GCL) plays a crucial role in multimedia applications due to its effectiveness in analyzing graph-structured data. Existing GCL methods focus on maximizing the agreement of node representations across different augmentations, which leads to the neglect of unique and complementary information in each augmentation. In this paper, we propose a fusion-based GCL model (FB-GCL) that learns fused representations to effectively capture complementary information from both the graph structure and node attributes. Our model consists of two modules: a graph fusion encoder and a graph contrastive module. The graph fusion encoder adaptively fuses the representations learned from the topology graph and the attribute graph. The graph contrastive module extracts supervision signals from the raw graph by leveraging both the pairwise relationships within the graph structure and the multi-label information from the attributes. Extensive experiments on seven benchmark datasets demonstrate that FB-GCL enhances performance in node classification and link prediction tasks. This improvement is especially valuable for multimedia data analysis, as integrating graph structure and attribute information is crucial for effectively understanding and processing complex datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5521-5532"},"PeriodicalIF":9.7,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Modal Reference Learning for Fine-Grained Text-to-Image Retrieval 用于细粒度文本到图像检索的多模态参考学习

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-24 DOI: 10.1109/TMM.2025.3543066

Zehong Ma;Hao Chen;Wei Zeng;Limin Su;Shiliang Zhang

{"title":"Multi-Modal Reference Learning for Fine-Grained Text-to-Image Retrieval","authors":"Zehong Ma;Hao Chen;Wei Zeng;Limin Su;Shiliang Zhang","doi":"10.1109/TMM.2025.3543066","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543066","url":null,"abstract":"Fine-grained text-to-image retrieval aims to retrieve a fine-grained target image with a given text query. Existing methods typically assume that each training image is accurately depicted by its textual descriptions. However, textual descriptions can be ambiguous and fail to depict discriminative visual details in images, leading to inaccurate representation learning. To alleviate the effects of text ambiguity, we propose a Multi-Modal Reference learning framework to learn robust representations. We first propose a multi-modal reference construction module to aggregate all visual and textual details of the same object into a comprehensive multi-modal reference. The multi-modal reference hence facilitates the subsequent representation learning and retrieval similarity computation. Specifically, a reference-guided representation learning module is proposed to use multi-modal references to learn more accurate visual and textual representations. Additionally, we introduce a reference-based refinement method that employs the object references to compute a reference-based similarity that refines the initial retrieval results. Extensive experiments are conducted on five fine-grained text-to-image retrieval datasets for different text-to-image retrieval tasks. The proposed method has achieved superior performance over state-of-the-art methods. For instance, on the text-to-person image retrieval dataset RSTPReid, our method achieves the Rank1 accuracy of 56.2%, surpassing the recent CFine by 5.6%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5009-5022"},"PeriodicalIF":9.7,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diverse Visible-to-Thermal Image Translation via Controllable Temperature Encoding 通过可控温度编码实现多种可见光到热图像的转换

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-21 DOI: 10.1109/TMM.2025.3543053

Lei Zhao;Mengwei Li;Bo Li;Xingxing Wei

{"title":"Diverse Visible-to-Thermal Image Translation via Controllable Temperature Encoding","authors":"Lei Zhao;Mengwei Li;Bo Li;Xingxing Wei","doi":"10.1109/TMM.2025.3543053","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543053","url":null,"abstract":"Translating readily available visible (VIS) images into thermal infrared (TIR) images effectively alleviates the shortage of TIR data. While current methods have yielded commendable results, they fall short in generating diverse and realistic thermal infrared images, primarily due to insufficient consideration of temperature variations. In this paper, we propose a Thermally Controlled GAN (TC-GAN) that leverages VIS images to generate diverse TIR images, with the ability to control the relative temperatures of multiple objects, particularly those with temperature variations. Firstly, we introduce the physical coding module, which employs a conditional variational autoencoder GAN to learn the distributions of relative temperature information for the objects and environmental state information. Then, the physical information can be obtained by sampling the distribution. When this information is fused with the visible image, it facilitates the generation of diverse TIR images. To ensure authenticity and strengthen the physical constraints across different regions of the image, we introduce a self-attention mechanism in the generator that prioritizes the relative temperature relationships within the image. Additionally, we utilize a local discriminator that focuses on objects with actively changing temperatures and their interactions with the surrounding environment, thereby reducing the discontinuity between the target and the background. Experiments on the Drone Vehicle and AVIID datasets show that our approach outperforms mainstream diversity generation methods in terms of authenticity and diversity.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5685-5695"},"PeriodicalIF":9.7,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Completed Interaction Networks for Pedestrian Trajectory Prediction 完成行人轨迹预测的交互网络

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-21 DOI: 10.1109/TMM.2025.3542967

Zhong Zhang;Jianglin Zhou;Shuang Liu;Baihua Xiao

{"title":"Completed Interaction Networks for Pedestrian Trajectory Prediction","authors":"Zhong Zhang;Jianglin Zhou;Shuang Liu;Baihua Xiao","doi":"10.1109/TMM.2025.3542967","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542967","url":null,"abstract":"The social and environmental interactions, as well as the pedestrian goal are crucial for pedestrian trajectory prediction. This is because they could learn both complex interactions in the scenes and the intentions of the pedestrians. However, most existing methods either learn the one-moment social interactions, or supervise the pedestrian trajectories using long-term goal, resulting in suboptimal prediction performances. In this paper, we propose a novel network named Completed Interaction Network (CINet) to simultaneously consider the social interactions in all moments, the environmental interactions and the short-term goal of pedestrians in a unified framework for pedestrian trajectory prediction. Specifically, we propose the Spatio-Temporal Transformer Layer (STTL) to fully mine the spatio-temporal information among historical trajectories of all pedestrians in order to obtain the social interactions in all moments. Additionally, we present the Gradual Goal Module (GGM) to capture the environmental interactions under the supervision of the short-term goal, which is beneficial to understanding the intentions of the pedestrian. Afterwards, we employ the cross-attention to effectively integrate the all-moment social and environmental interactions. The experimental results on three standard pedestrian datasets, i.e., ETH/UCY, SDD and inD demonstrate that our method achieves a new state-of-the-art performance. Furthermore, the visualization results indicate that our method could predict trajectories more reasonably in complex scenarios such as sharp turns, infeasible areas and so on.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5119-5129"},"PeriodicalIF":9.7,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CACP: Covariance-Aware Cross-Domain Prototypes for Domain Adaptive Semantic Segmentation 领域自适应语义分割的协方差感知跨领域原型

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-21 DOI: 10.1109/TMM.2025.3543016

Yanbing Xue;Xinyu Tian;Feifei Zhang;Xianbin Wen;Zan Gao;Shengyong Chen

{"title":"CACP: Covariance-Aware Cross-Domain Prototypes for Domain Adaptive Semantic Segmentation","authors":"Yanbing Xue;Xinyu Tian;Feifei Zhang;Xianbin Wen;Zan Gao;Shengyong Chen","doi":"10.1109/TMM.2025.3543016","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543016","url":null,"abstract":"Domain adaptive semantic segmentation aims to reduce domain shifts / discrepancies between source and target domains, improving the source domain model's generalization ability to the target domain. Recently, prototypical methods, which primarily use single-source or single-target domain prototypes as category centers to aggregate features from both domains, have achieved competitive performance in this task. However, due to large domain shifts, single-source domain prototypes have finite generalization ability and not all source domain knowledge is conducive to model generalization. Single-target domain prototypes are noisy because they are prematurely initialized with all features filtered by pseudo labels, which causes error accumulation in the prototypes. To address these issues, we propose a covariance-aware cross-domain prototypes method (CACP) to achieve robust domain adaptation. We propose to use both domain prototypes to dynamically rectify pseudo labels in the target domain, effectively reducing the recognition difficulty of hard target domain samples and narrowing the gap between features of the same category in both domains. In addition, to further generalize the model to the target domain, we propose two modules based on covariance correlation, FSPC (Features Selection by Prototypes Covariances) and WSPC (Weighting Source by Prototypes Coefficients), to learn discriminative characteristics. FSPC selects highly correlated features to update target domain prototypes online, denoising and enhancing discriminativeness between categories. WSPC utilizes the correlation coefficients between target domain prototypes and source domain features to weight each point in the source domain, eliminating the information interference from the source domain. In particular, CACP achieves excellent performance on the GTA5 <inline-formula><tex-math>$to$</tex-math></inline-formula> Cityscapes and SYNTHIA <inline-formula><tex-math>$to$</tex-math></inline-formula> Cityscapes tasks with minimal computational resources and time.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5023-5034"},"PeriodicalIF":9.7,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging Content and Context Cues for Low-Light Image Enhancement 利用内容和上下文线索的低光图像增强

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-21 DOI: 10.1109/TMM.2025.3543047

Igor Morawski;Kai He;Shusil Dangi;Winston H. Hsu

{"title":"Leveraging Content and Context Cues for Low-Light Image Enhancement","authors":"Igor Morawski;Kai He;Shusil Dangi;Winston H. Hsu","doi":"10.1109/TMM.2025.3543047","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543047","url":null,"abstract":"Low-light conditions have an adverse impact on machine cognition, limiting the performance of computer vision systems in real life. Since low-light data is limited and difficult to annotate, we focus on image processing to enhance low-light images and improve the performance of any downstream task model, instead of fine-tuning each of the models which can be prohibitively expensive. We propose to improve the existing zero-reference low-light enhancement by leveraging the CLIP model to capture image prior and for semantic guidance. Specifically, we propose a data augmentation strategy to learn an image prior via prompt learning, based on image sampling, to learn the image prior without any need for paired or unpaired normal-light data. Next, we propose a semantic guidance strategy that maximally takes advantage of existing low-light annotation by introducing both content and context cues about the image training patches. We experimentally show, in a qualitative study, that the proposed prior and semantic guidance help to improve the overall image contrast and hue, as well as improve background-foreground discrimination, resulting in reduced over-saturation and noise over-amplification, common in related zero-reference methods. As we target machine cognition, rather than rely on assuming the correlation between human perception and downstream task performance, we conduct and present an ablation study and comparison with related zero-reference methods in terms of task-based performance across many low-light datasets, including image classification, object and face detection, showing the effectiveness of our proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5337-5351"},"PeriodicalIF":9.7,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DSAF: Dual Space Alignment Framework for Visible-Infrared Person Re-Identification DSAF：可见-红外人员再识别的双空间对准框架

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-21 DOI: 10.1109/TMM.2025.3542988

Yan Jiang;Xu Cheng;Hao Yu;Xingyu Liu;Haoyu Chen;Guoying Zhao

{"title":"DSAF: Dual Space Alignment Framework for Visible-Infrared Person Re-Identification","authors":"Yan Jiang;Xu Cheng;Hao Yu;Xingyu Liu;Haoyu Chen;Guoying Zhao","doi":"10.1109/TMM.2025.3542988","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542988","url":null,"abstract":"Visible-infrared person re-identification (VI-ReID) is a cross-modality retrieval task that aims to match visible and infrared pedestrian images across non-overlapped cameras. However, we observe that three crucial challenges remain inadequately addressed by existing methods: (i) limited discriminative capacity for modality-shared representation, (ii) modality misalignment, and (iii) neglect of identity consistency knowledge. To solve the above issues, we propose a novel dual space alignment framework (DSAF) to constrain the modality in two specific spaces. Specifically, for (i), we design a lightweight and plug-and-play modality invariant enhancement (MIE) module to capture fine-grained semantic information and render identity discriminative. This facilitates the establishment of correlations between visible and infrared modalities, enabling the model to learn robust modality-shared features. To tackle (ii), a dual space alignment (DSA) is introduced to conduct the pixel-level alignment in both Euclidean space and Hilbert space. DSA establishes an elastic relationship between these two spaces, remaining invariant knowledge across two spaces. To solve (iii), we propose an adaptive identity-consistent learning (AIL) to discover identity-consistent knowledge between visible and infrared modalities in a dynamic manner. Extensive experiments on mainstream VI-ReID benchmarks show the superiority and flexibility of our proposed method, achieving competitive performance on mainstream datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5591-5603"},"PeriodicalIF":9.7,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Visual Class Incremental Learning With Textual Priors Guidance Based on an Adapted Vision-Language Model 基于自适应视觉语言模型的文本先验引导视觉类增量学习

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-21 DOI: 10.1109/TMM.2025.3543109

Wentao Zhang;Tong Yu;Ruixuan Wang;Jianhui Xie;Emanuele Trucco;Wei-Shi Zheng;Xiaobo Yang

{"title":"Visual Class Incremental Learning With Textual Priors Guidance Based on an Adapted Vision-Language Model","authors":"Wentao Zhang;Tong Yu;Ruixuan Wang;Jianhui Xie;Emanuele Trucco;Wei-Shi Zheng;Xiaobo Yang","doi":"10.1109/TMM.2025.3543109","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543109","url":null,"abstract":"An ideal artificial intelligence (AI) system should have the capability to continually learn like humans. However, when learning new knowledge, AI systems often suffer from catastrophic forgetting of old knowledge. Although many continual learning methods have been proposed, they often ignore the issue of misclassifying similar classes and make insufficient use of textual priors of visual classes to improve continual learning performance. In this study, we propose a continual learning framework based on a pre-trained vision-language model (VLM) that does not require storing old class data. This framework utilizes parameter-efficient fine-tuning of the VLM's text encoder for constructing a shared and consistent semantic textual space throughout the continual learning process. The textual priors of visual classes are encoded by the adapted VLM's text encoder to generate discriminative semantic representations, which are then used to guide the learning of visual classes. Additionally, fake out-of-distribution (OOD) images constructed from each training image further assist in the learning of visual classes. Extensive empirical evaluations on three natural datasets and one medical dataset demonstrate the superiority of the proposed framework.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5426-5438"},"PeriodicalIF":9.7,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neural Volumetric Video Coding With Hierarchical Coded Representation of Dynamic Volume 基于动态体积分层编码表示的神经体积视频编码

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-02-20 DOI: 10.1109/TMM.2025.3544415

Ju-Yeon Shin;Jung-Kyung Lee;Gun Bang;Jun-Sik Kim;Je-Won Kang

{"title":"Neural Volumetric Video Coding With Hierarchical Coded Representation of Dynamic Volume","authors":"Ju-Yeon Shin;Jung-Kyung Lee;Gun Bang;Jun-Sik Kim;Je-Won Kang","doi":"10.1109/TMM.2025.3544415","DOIUrl":"https://doi.org/10.1109/TMM.2025.3544415","url":null,"abstract":"This article proposes a novel multi-view (MV) video coding technique that leverages a four-dimensional (4D) voxel-grid representation to enhance coding efficiency, particularly in novel view synthesis. Although the voxel grid approximation provides a continuous representation for dynamic scenes, its volumetric nature requires substantial storage. The compression of MV videos can be interpreted as the compression of dense features. However, the substantial size of these features poses a significant problem relative to the generation of dynamic scenes at arbitrary viewpoints. To address this challenge, this study introduces a hierarchical coded representation of dynamic volumes based on low-rank tensor decomposition of volumetric features and develops effective coding techniques based on this representation. The proposed method employs a two-level coding strategy to capture the temporal characteristics of the decomposed features. At a higher level, spatial features are encoded, representing 3D structural information, with time-invariant components over short intervals of an MV video sequence. At a lower level, temporal features are encoded to capture the dynamics of current scenes. The spatial features are shared in a group, and temporal features are encoded at each time step. The experimental results demonstrate that the proposed technique outperforms existing MV video coding standards and current state-of-the-art methods, providing superior rate-distortion performance in the novel view synthesis of MV video compression.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4412-4426"},"PeriodicalIF":9.7,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0