{"title":"Auto-adjustable dual-information graph regularized NMF for multiview data clustering","authors":"Shuo Li , Chen Yang , Hui Guo","doi":"10.1016/j.patcog.2025.111679","DOIUrl":"10.1016/j.patcog.2025.111679","url":null,"abstract":"<div><div>Multiview data processing has gained significant attention in machine learning due to its ability to integrate complementary information from diverse data sources. Among various multiview clustering methods, non-negative matrix factorization (NMF)-based approaches have shown strong potential. However, existing methods rely on fixed, single-loss functions and single manifold regularization terms, which limit their adaptability to diverse and heterogeneous datasets. To address these challenges, we propose the multiview auto-adjustable robust dual-information graph regularized non-negative matrix factorization (MARDNMF). This method introduces a novel set of dynamically adjustable loss functions, each incorporating two correntropy terms, which are tuned via adaptive parameters based on the data characteristics. Additionally, MARDNMF leverages multi-scale k-nearest neighbors (KNNs) to build a dual-information graph regularization term, capturing both local and discriminative manifold information. Experimental results across various datasets demonstrate that MARDNMF outperforms existing NMF-based methods in both single view and multiview clustering scenarios, offering enhanced robustness and adaptability.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111679"},"PeriodicalIF":7.5,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143839373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Masked auto-encoding and scatter-decoupling transformer for polarimetric SAR image classification","authors":"Jie Geng, Lijia Dong, Yuhang Zhang, Wen Jiang","doi":"10.1016/j.patcog.2025.111660","DOIUrl":"10.1016/j.patcog.2025.111660","url":null,"abstract":"<div><div>The pixel level annotation of polarimetric SAR (PolSAR) image is quite difficult and requires a significant amount of manpower. Deep learning based PolSAR image classification generally faces the challenge of scarce labeled data. To address the above issue, we propose a self-supervised learning model based on masked auto-encoding and scatter-decoupling transformer (MAST) for PolSAR image classification, which aims to fully utilize a large number of unlabeled data. Combined with PolSAR scattering characteristics, an effective pre-training auxiliary task is designed to constrain the model in order to learn spatial information and global scattering representation from SAR images. In the fine-tuning stage, a scattering embedding module is applied to strengthen the representation of global semantic information with specific scattering characteristics. In addition, a supervised contrastive loss is introduced to improve the robustness of the classifier. Sufficient experiments are conducted on three public PolSAR datasets, and the results demonstrate the effectiveness of the proposed method.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111660"},"PeriodicalIF":7.5,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143824358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cascaded Physical-constraint Conditional Variational Auto Encoder with socially-aware diffusion for pedestrian trajectory prediction","authors":"Haojie Chen , Zhuo Wang , Hongde Qin , Xiaokai Mu","doi":"10.1016/j.patcog.2025.111667","DOIUrl":"10.1016/j.patcog.2025.111667","url":null,"abstract":"<div><div>Pedestrian trajectory prediction serves as a crucial prerequisite for various tasks such as autonomous driving and human–robot interaction. The existing methods mainly leverage deep learning-based generative models to predict future multi-modal trajectories. Nevertheless, the inherent uncertainty in pedestrian movements poses a challenge for deep generative models to generate accurate and plausible future trajectories. In this paper, we propose a two-stage trajectory prediction network termed CPSD. In the first stage, a Cascaded Physical-constraint Conditional Variational Auto Encoder is proposed. It combines Differentiable Physical Constraint Conditional Variational Auto Encoders in the cascaded form to predict the trajectory coordinates with a stepwise manner, which improves the interpretability of deep generative network and alleviates the problem of prediction error accumulation over time. In the second stage, a Socially-aware Diffusion Model is proposed to refine the initial trajectory generated in the first stage. By introducing a non-local attention mechanism and constructing a social mask, we integrate pedestrian social interactions into the diffusion model, enabling the refinement of more realistic and plausible multi-modal pedestrian trajectories. Extensive experiments conducted on the public datasets SDD and ETH/UCY demonstrate that CPSD achieves more promising pedestrian trajectories compared with other state-of-the-art trajectory prediction algorithms.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111667"},"PeriodicalIF":7.5,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143839385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weitong Cai , Jiabo Huang , Shaogang Gong , Hailin Jin , Yang Liu
{"title":"MLLM as video narrator: Mitigating modality imbalance in video moment retrieval","authors":"Weitong Cai , Jiabo Huang , Shaogang Gong , Hailin Jin , Yang Liu","doi":"10.1016/j.patcog.2025.111670","DOIUrl":"10.1016/j.patcog.2025.111670","url":null,"abstract":"<div><div>Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, <em>i.e.</em>, the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we propose a novel MLLM-driven framework Text-Enhanced Alignment (TEA), to address the modality imbalance problem by enhancing the correlated visual-textual knowledge. TEA takes an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111670"},"PeriodicalIF":7.5,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143859305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aihua Ke , Bo Cai , Yujie Huang , Jian Luo , Yaoxiang Yu , Le Li
{"title":"Dual-function discriminator for semantic image synthesis in variational GANs","authors":"Aihua Ke , Bo Cai , Yujie Huang , Jian Luo , Yaoxiang Yu , Le Li","doi":"10.1016/j.patcog.2025.111684","DOIUrl":"10.1016/j.patcog.2025.111684","url":null,"abstract":"<div><div>Semantic image synthesis aims to generate target images conditioned on given semantic labels, but existing methods often struggle with maintaining high visual quality and accurate semantic alignment. To address these challenges, we propose VD-GAN, a novel framework that integrates advanced architectural and functional innovations. Our variational generator, built on an enhanced U-Net architecture combining a pre-trained Swin transformer and CNN, captures both global and local semantic features, generating high-quality images. To further boost performance, we design two innovative modules: the Conditional Residual Attention Module (CRAM) for dimensionality reduction modulation and the Channel and Spatial Attention Mechanism (CSAM) for extracting key semantic relationships across channel and spatial dimensions. Additionally, we introduce a dual-function discriminator that not only distinguishes real and synthesized images, but also performs multi-class segmentation on synthesized images, guided by a redefined class-balanced cross-entropy loss to ensure semantic consistency. Extensive experiments show that VD-GAN outperforms the latest supervised methods, with improvements of (FID, mIoU, Acc) by (5.40%, 4.37%, 1.48%) and increases in auxiliary metrics (LPIPS, TOPIQ) by (2.45%, 23.52%). The code will be available at <span><span>https://github.com/ah-ke/VD-GAN.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111684"},"PeriodicalIF":7.5,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143851757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving imbalanced medical image classification through GAN-based data augmentation methods","authors":"Hongwei Ding , Nana Huang , Yaoxin Wu , Xiaohui Cui","doi":"10.1016/j.patcog.2025.111680","DOIUrl":"10.1016/j.patcog.2025.111680","url":null,"abstract":"<div><div>In the medical field, there exists a prevalent issue of data imbalance, severely impacting the performance of machine learning. Traditional data augmentation methods struggle to effectively generate augmented samples with strong diversity. Generative Adversarial Networks (GANs) can produce more effective new samples by learning the global distribution of samples. Although existing GAN models can balance inter-class distributions, the presence of sparse samples within classes can lead to intra-class mode collapse, rendering them unable to effectively fit the sparse region distribution. Based on this, our study proposes a two-step solution. Firstly, we employ a Cluster-Based Local Outlier Factor (CBLOF) algorithm to identify sparse and dense samples intra-class. Then, using these sparse and dense samples as conditions, we train the GAN model to better focus on fitting sparse samples intra-class. Finally, after training the GAN model, we propose using the One-Class SVM (OCS) algorithm as a noise filter to obtain pure augmented samples. We conducted extensive validation experiments on four medical datasets: BloodMNIST, OrganCMNIST, PathMNIST, and PneumoniaMNIST. The experimental results indicate that the method proposed in this study can generate samples with greater diversity and higher quality. Furthermore, by incorporating augmented samples, the accuracy improved by approximately 3% across four datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111680"},"PeriodicalIF":7.5,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143824357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiming Cheng , Mingxia Liu , Defu Yang , Zhidong Zhao , Chenggang Yan , Shuai Wang
{"title":"Domain generalization for image classification with dynamic decision boundary","authors":"Zhiming Cheng , Mingxia Liu , Defu Yang , Zhidong Zhao , Chenggang Yan , Shuai Wang","doi":"10.1016/j.patcog.2025.111678","DOIUrl":"10.1016/j.patcog.2025.111678","url":null,"abstract":"<div><div>Domain Generalization (DG) has been widely used in image classification tasks to effectively handle distribution shifts between source and target domains without accessing target domain data. Traditional DG methods typically rely on static models trained on the source domain for inference on unseen target domains, limiting their ability to fully leverage target domain characteristics. Test-Time Adaptation (TTA)-based DG methods improve generalization performance by adapting the model during inference using target domain samples. However, this often requires parameter fine-tuning on unseen target domains during inference, which may lead to forgetting of source domain knowledge or reduce real-time performance. To address this limitation, we propose a Dynamic Decision Boundary-based DG (DDB-DG) method for image classification, which effectively leverages target domain characteristics during inference without requiring additional training. In the proposed DDB-DG, we first introduce a Prototype-guide Multi-lever Prediction (PMP) module, which guides the dynamic adjustment of the decision boundary learned from the source domain by leveraging the correlation between test samples and prototypes. To enhance the accuracy of prototype computation, we also propose a data augmentation method called Uncertainty Style Mixture (USM), which expands the diversity of training samples to improve model generalization performance and enhance the accuracy of pseudo-labeling for target domain samples in prototypes. We validate DDB-DG using different backbone networks on three publicly available benchmark datasets: PACS, Office-Home, and VLCS. Experimental results demonstrate that our method achieves superior performance on both ResNet-18 and ResNet-50, surpassing the state-of-the-art DG and TTA methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111678"},"PeriodicalIF":7.5,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143824342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring dynamic plane representations for neural scene reconstruction","authors":"Ruihong Yin , Yunlu Chen , Sezer Karaoglu , Theo Gevers","doi":"10.1016/j.patcog.2025.111683","DOIUrl":"10.1016/j.patcog.2025.111683","url":null,"abstract":"<div><div>The efficient tri-plane representations present limited expressivity for encoding complex 3D scenes. To cope with the hampered spatial expressivity of tri-planes, this paper proposes a novel dynamic plane representation method for 3D scene reconstruction, including dynamic long-axis plane learning, a point-to-plane relationship module, and explicit coarse-to-fine feature projection. First, the proposed dynamic long-axis plane learning employs several planes along the principal axis and adapts planar positions dynamically, which can enhance geometry expressivity. Second, a point-to-plane relationship module is proposed to capture distinguished point features by learning the feature bias between plane features and point features. Third, the explicit coarse-to-fine feature projection employs a non-linear transformation to capture fine features from learnable coarse features, exploiting both local and global information with fewer increases in parameters. Experimental results on ScanNet and 7-Scenes demonstrate that our method achieves state-of-the-art performance with comparable computational costs.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111683"},"PeriodicalIF":7.5,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143834061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Wang , Kaihao Zhang , Yong Zhang , Wenhan Luo , Björn Stenger , Tong Lu , Tae-Kyun Kim , Wei Liu
{"title":"LLDiffusion: Learning degradation representations in diffusion models for low-light image enhancement","authors":"Tao Wang , Kaihao Zhang , Yong Zhang , Wenhan Luo , Björn Stenger , Tong Lu , Tae-Kyun Kim , Wei Liu","doi":"10.1016/j.patcog.2025.111628","DOIUrl":"10.1016/j.patcog.2025.111628","url":null,"abstract":"<div><div>Current deep learning methods for low-light image enhancement typically rely on pixel-wise mappings using paired data, often overlooking the specific degradation factors inherent to low-light conditions, such as noise amplification, reduced contrast, and color distortion. This oversight can result in suboptimal performance. To address this limitation, we propose a degradation-aware learning framework that explicitly integrates degradation representations into the model design. We introduce LLDiffusion, a novel model composed of three key modules: a Degradation Generation Network (DGNET), a Dynamic Degradation-Aware Diffusion Module (DDDM), and a Latent Map Encoder (E). This approach enables joint learning of degradation representations, with the pre-trained Encoder (E) and DDDM effectively incorporating degradation and image priors into the diffusion process for improved enhancement. Extensive experiments on public benchmarks show that LLDiffusion outperforms state-of-the-art low-light image enhancement methods quantitatively and qualitatively. The source code and pre-trained models will be available at <span><span>https://github.com/TaoWangzj/LLDiffusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111628"},"PeriodicalIF":7.5,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143839393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fengyi Fang , Zihan Liao , Zhehan Kan , Guijin Wang , Wenming Yang
{"title":"MDSI: Pluggable Multi-strategy Decoupling with Semantic Integration for RGB-D Gesture Recognition","authors":"Fengyi Fang , Zihan Liao , Zhehan Kan , Guijin Wang , Wenming Yang","doi":"10.1016/j.patcog.2025.111653","DOIUrl":"10.1016/j.patcog.2025.111653","url":null,"abstract":"<div><div>Gestures encompass intricate visual representations, containing both task-relevant cues such as hand shapes and task-irrelevant elements like backgrounds and performer appearances. Despite progress in RGB-D-based gesture recognition, two primary challenges persist: (i) <em>Information Redundancy</em> (IR), which hinders the task-relevant feature extraction in the entangled space and misleads the recognition; (ii) <em>Information Absence</em> (IA), which exacerbates the difficulty of identifying visually similar instances. To alleviate these drawbacks, we propose a pluggable Multi-strategy Decoupling with Semantic Integration methodology, termed MDSI, for RGB-D gesture recognition. For IR, we introduce a Multi-strategy Decoupling Network (MDN) to precisely segregate pose-motion and spatial-temporal-channel features across modalities, thus effectively mitigating redundant information. For IA, we introduce the Semantic Integration Network (SIN), which integrates natural language modeling through semantic filtering and semantic label smoothing, markedly enhancing the model’s semantic understanding and knowledge integration. MDSI’s pluggable architecture allows for seamless integration into various RGB-D-based gesture recognition methods with minimal computational overhead. Experiments conducted on two public datasets demonstrate that our approach provides better feature representation and achieves better performance than state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111653"},"PeriodicalIF":7.5,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143855749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}