{"title":"SVSRD: Spatial Visual and Statistical Relation Distillation for Class-Incremental Semantic Segmentation","authors":"Yuyang Chang;Yifan Jiao;Bing-Kun Bao","doi":"10.1109/TMM.2025.3543102","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543102","url":null,"abstract":"Class-incremental semantic segmentation (CISS) aims to incrementally learn novel classes while retaining the ability to segment old classes, and suffers catastrophic forgetting since the old-class labels are unavailable. Most existing methods typically impose strict constraints on the consistency between the extracted features or output logits of each pixel from old and current models in an attempt to prevent forgetting through knowledge distillation (KD), which 1) results in a significant transfer of redundant knowledge while limiting the restoration of old classes (rigidity) due to potentially overlooking essential knowledge extraction, and 2) imposes strong constraints at the pixel level making it challenging for the model to learn novel classes (plasticity). To solve the above limitations, we propose a novel Spatial Visual and Statistical Relation Distillation (SVSRD) by applying multi-scale visual and statistical position relation distillation for CISS, which enjoys several merits. First, we introduce a region-based similarity matrix and impose a consistency constraint between current and old models, which preserves the essential visual knowledge to enhance the rigidity. Second, we propose a novel statistical feature calculation algorithm to investigate the distribution of the data and further preserve the rules of statistics through statistical consistency, which also promotes the model on the novel-class learning for improving the plasticity. Finally, the aforementioned constraints are jointly applied in multiple scales to alleviate old-class forgetting and enhance novel-class learning. Extensive experiments on Pascal-VOC 2012 and ADE20 K demonstrate that the proposed approach performs favorably against the state-of-the-art CISS methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4869-4881"},"PeriodicalIF":9.7,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144750883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tong Liu;Jing Li;Jia Wu;Bo Du;Yibing Zhan;Dapeng Tao;Jun Wan
{"title":"Facial Expression Recognition With Heatmap Neighbor Contrastive Learning","authors":"Tong Liu;Jing Li;Jia Wu;Bo Du;Yibing Zhan;Dapeng Tao;Jun Wan","doi":"10.1109/TMM.2025.3543029","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543029","url":null,"abstract":"Many supervised learning-based facial expression recognition (FER) methods achieve good performance with the assistance of expression labels and a complex framework. However, there are inconsistent annotations in different expression datasets, making the above methods disadvantageous for new expression datasets or datasets with limited training data. The objective of this paper is to learn self-supervised facial expression features that enable the FER model not to rely on the annotation consistency of the different datasets. Most current self-supervised learning algorithms based on contrastive learning learn the representation by forcing different augmented views of the same image close in the embedding space, but they cannot cover all variances within a semantic class. We propose a heatmap neighbor contrastive learning (HNCL) method for FER. It treats the images corresponding to the heatmap nearest neighbors of expressions as other positives, providing more semantic variations than pre-defined augmented transformations. Therefore, our HNCL can learn better expression features covering more intra-class variances, improving the performance of the FER model based on self-supervised learning. After fine-tuning, HNCL with a simple framework achieves top-three performance on the in-the-lab datasets and even matches the performance of state-of-the-art supervised learning methods on the in-the-wild datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4795-4807"},"PeriodicalIF":9.7,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144750999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Event-Based Video Reconstruction With Bidirectional Temporal Information","authors":"Pinghai Gao;Longguang Wang;Sheng Ao;Ye Zhang;Yulan Guo","doi":"10.1109/TMM.2025.3543010","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543010","url":null,"abstract":"Event-based video reconstruction has emerged as an appealing research direction to break through the limitations of traditional cameras to better record dynamic scenes. Most existing methods reconstruct each frame from its corresponding event subset in chronological order. Since the temporal information contained in the whole event sequence is not fully exploited, these methods suffer inferior reconstruction quality. In this paper, we propose to enhance event-based video reconstruction by leveraging the bidirectional temporal information in event sequences. The proposed model processes event sequences in a bidirectional fashion, allowing for exploiting bidirectional information in the whole sequence. Furthermore, a transformer-based temporal information fusion module is introduced to aggregate long-range information in both temporal and spatial dimensions. Additionally, we propose a new dataset for the event-based video reconstruction task which contains a variety of objects and movement patterns. Extensive experiments demonstrate that the proposed model outperforms existing state-of-the-art event-based video reconstruction methods both quantitatively and qualitatively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4831-4843"},"PeriodicalIF":9.7,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Baoliang Chen;Hanwei Zhu;Lingyu Zhu;Shanshe Wang;Jingshan Pan;Shiqi Wang
{"title":"Debiased Mapping for Full-Reference Image Quality Assessment","authors":"Baoliang Chen;Hanwei Zhu;Lingyu Zhu;Shanshe Wang;Jingshan Pan;Shiqi Wang","doi":"10.1109/TMM.2025.3535280","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535280","url":null,"abstract":"An ideal full-reference image quality (FR-IQA) model should exhibit both high separability for images with different quality and compactness for images with the same or indistinguishable quality. However, existing learning-based FR-IQA models that directly compare images in deep-feature space, usually overly emphasize the quality separability, neglecting to maintain the compactness when images are of similar quality. In our work, we identify that the perception bias mainly stems from an inappropriate subspace where images are projected and compared. For this issue, we propose a Debiased Mapping based quality Measure (DMM), leveraging orthonormal bases formed by singular value decomposition (SVD) in the deep features domain. The SVD effectively decomposes the quality variations into singular values and mapping bases, enabling quality inference with more reliable feature difference measures. Extensive experimental results reveal that our proposed measure could mitigate the perception bias effectively and demonstrates excellent quality prediction performance on various IQA datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2638-2649"},"PeriodicalIF":8.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation","authors":"Liyang Chen;Weihong Bao;Shun Lei;Boshi Tang;Zhiyong Wu;Shiyin Kang;Haozhi Huang;Helen Meng","doi":"10.1109/TMM.2025.3535287","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535287","url":null,"abstract":"Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose <bold>AdaMesh</b>, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3598-3609"},"PeriodicalIF":8.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weakly Supervised LiDAR Semantic Segmentation via Scatter Image Annotation","authors":"Yilong Chen;Zongyi Xu;Xiaoshui Huang;Shanshan Zhao;Xinqi Jiang;Xinyu Gao;Xinbo Gao","doi":"10.1109/TMM.2025.3535350","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535350","url":null,"abstract":"Weakly supervised LiDAR semantic segmentation has made significant strides with limited labeled data. However, most existing methods focus on the network training under weak supervision, while efficient annotation strategies remain largely unexplored. To tackle this gap, we implement LiDAR semantic segmentation using scatter image annotation, effectively integrating an efficient annotation strategy with network training. Specifically, we propose employing scatter images to annotate LiDAR point clouds, combining a pre-trained optical flow estimation network with a foundational image segmentation model to rapidly propagate manual annotations into dense labels for both images and point clouds. Moreover, we propose ScatterNet, a network that includes three pivotal strategies to reduce the performance gap caused by such annotations. First, it utilizes dense semantic labels as supervision for the image branch, alleviating the modality imbalance between point clouds and images. Second, an intermediate fusion branch is proposed to obtain multimodal texture and structural features. Finally, a perception consistency loss is introduced to determine which information needs to be fused and which needs to be discarded during the fusion process. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our method requires less than 0.02% of the labeled points to achieve over 95% of the performance of fully-supervised methods. Notably, our labeled points are only 5% of those used in the most advanced weakly supervised methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4121-4136"},"PeriodicalIF":8.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Replay-Based Incremental Object Detection With Local Response Exploration","authors":"Jian Zhong;Yifan Jiao;Bing-Kun Bao","doi":"10.1109/TMM.2025.3535403","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535403","url":null,"abstract":"Incremental object detection (IOD) aims to train an object detector on non-stationary data streams without forgetting previous knowledge. Prevalent replay-based methods keep a buffer composed of carefully selected instances towards this goal. However, due to the limited storage space and uniform feature distribution, existing methods are prone to overfit on replayed instances, leading to poor generalization on diverse test data. Additionally, the imbalance in data quantity makes the detector fail to distinguish old and new classes that are visually similar, introducing bias toward new classes. To enhance the diversity of stored instances and eliminate bias, we propose a Local Response Exploration (LRE) framework, which comprises three modules. First, Region-Entropy Instance Selector (REIS) introduces a novel metric to assess instance diversity based on the entropy of local responses. Second, Confusion-Guided Instance Replay (CGIR) replaces the previous random replay approach by replaying specific old class instances based on class similarity, ensuring that parameters for similar new and old classes are updated together, thereby mitigating bias and helping mining discriminative patterns. Third, Confusion-Aware Region Segregation (CARS) adaptively differentiates biased regions from other regions based on local responses, reducing bias toward new classes while preserving relationships between new and old classes. Extensive evaluations on Pascal-VOC and MS COCO datasets demonstrate that our approach outperforms State-of-the-Art methods in incremental object detection.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4348-4360"},"PeriodicalIF":8.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144581588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junbin Yuan;Yiqi Wang;Zhoutao Wang;Qingzhen Xu;Bharadwaj Veeravalli;Xulei Yang
{"title":"DPPNet: A Depth Pixel-Wise Potential-Aware Network for RGB-D Salient Object Detection","authors":"Junbin Yuan;Yiqi Wang;Zhoutao Wang;Qingzhen Xu;Bharadwaj Veeravalli;Xulei Yang","doi":"10.1109/TMM.2025.3535386","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535386","url":null,"abstract":"Depth cues are essential for visual perception tasks like Salient Object Detection (SOD). Due to varying depth reliability across scenes, some researchers propose evaluating the overall quality of the depth maps and discarding the less reliable ones to avoid contamination. However, these methods often fail to fully utilize valuable information in depth maps, leading to sub-optimal performance particularly when depth quality is unreliable. Since low-quality depth maps still contain useful information that potentially improves model performance, we propose a Depth Pixel-wise Potential-aware Network to leverage these depth cues effectively. This network includes two novel components designed: 1) A learning strategy for explicitly modeling the confidence of each depth pixel to assist the model in locating valid information in the depth map. 2) A cross-modal adaptive multiple fusion module that fuses features from both RGB and depth modalities. It aims to mitigate the contamination effect of unreliable depth maps and fully exploit the benefits of multiple fusion strategies. Experimental results show that on four publicly available datasets, our method outperforms 17 mainstream methods on various evaluation metrics.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4256-4268"},"PeriodicalIF":8.4,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaming Wang;Xitong Chen;Xiao Huang;Ruiqian Zhang;Yu Wang;Tao Lu
{"title":"Rethinking the Role of Panchromatic Images in Pan-Sharpening","authors":"Jiaming Wang;Xitong Chen;Xiao Huang;Ruiqian Zhang;Yu Wang;Tao Lu","doi":"10.1109/TMM.2025.3535309","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535309","url":null,"abstract":"Recent pan-sharpening methods have predominantly utilized techniques tailored for natural image scenes, often overlooking the unique features arising from non-overlapping spectral responses. In light of this, we have reevaluated the utility of panchromatic (PAN) images and introduced a theory anchored in the spectral response of satellite sensors. This posits that a PAN image is effectively a linear weighted summation of individual bands from its corresponding multi-spectral (MS) image, offset by an error map. We developed a deep unmixing network termed “DUN” that integrates an unmixing network, a fusion mechanism, and a distinctive mutual information contrastive loss function. Notably, the unmixing network is adept at decomposing a PAN image into its MS counterpart and error map. Further, the demixed image alongside the low-resolution MS image is channeled into the fusion network for pan-sharpening. Recognizing the challenges of achieving robust supervised learning directly from the unmixing phase, we have innovated a mutual information contrastive learning loss function, ensuring enhanced separation and minimizing overlap during the unmixing process. Preliminary experiments underscore both the quantitative and qualitative prowess of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4558-4570"},"PeriodicalIF":9.7,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhu Yin;Zhongcheng Wu;Wuzhen Shi;Guyue Hu;Weisi Lin
{"title":"Video Compressed Sensing Via Wavelet Residual Sampling and Dual-Domain Fusion","authors":"Zhu Yin;Zhongcheng Wu;Wuzhen Shi;Guyue Hu;Weisi Lin","doi":"10.1109/TMM.2025.3535326","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535326","url":null,"abstract":"Deep learning-based compressed sensing (CS) technology attracts widespread attention owing to its remarkable reconstruction with only a few sampling measurements and low computational complexity. However, the existing video compressive sampling approaches cannot fully exploit the inherent interframe and intraframe correlations and sparsity of video sequences. To address this limitation, a novel sampling and reconstruction method for video CS (called WRDD) is proposed, which exploits the advantages of wavelet residual sampling and dual-domain fusion optimization. Specifically, in order to capture high-frequency details and achieve efficient and high-quality measurements, we propose a wavelet residual (WR) sampling strategy for the nonkeyframe sampling, which is achieved by the wavelet residuals between nonkeyframes and keyframes. Furthermore, a dual-domain (DD) fusion strategy is proposed, which fully combine intraframe and interframe to improve the reconstruction quality of nonkeyframes both in the pixel domain and multilevel feature domains. Extensive experiments demonstrate that our WRDD surpasses the state-of-the-art video and image CS methods in both subjective and objective evaluations. Besides, it exhibits outstanding antinoise capability and computational efficiency.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4240-4255"},"PeriodicalIF":8.4,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}