Sana Khamekhem Jemni , Sourour Ammar , Mohamed Ali Souibgui , Yousri Kessentini , Abbas Cheddad
{"title":"ST-KeyS: Self-supervised Transformer for Keyword Spotting in historical handwritten documents","authors":"Sana Khamekhem Jemni , Sourour Ammar , Mohamed Ali Souibgui , Yousri Kessentini , Abbas Cheddad","doi":"10.1016/j.patcog.2025.112036","DOIUrl":"10.1016/j.patcog.2025.112036","url":null,"abstract":"<div><div>Keyword spotting (KWS) in historical documents is an important tool for the initial exploration of digitized collections. Nowadays, the most efficient KWS methods rely on machine learning techniques, which typically require a large amount of annotated training data. However, in the case of historical manuscripts, there is a lack of annotated corpora for training. To handle the data scarcity issue, we investigate the merits of self-supervised learning to extract useful representations of the input data without relying on human annotations and then use these representations in the downstream task. We propose ST-KeyS, a masked auto-encoder model based on vision transformers where the pretraining stage is based on the mask-and-predict paradigm without the need for labeled data. In the fine-tuning stage, the pre-trained encoder is integrated into a fine-tuned Siamese neural network model to improve feature embedding from the input images. We further improve the image representation using pyramidal histogram of characters (PHOC) embedding to create and exploit an intermediate representation of images based on text attributes. The proposed approach outperforms state-of-the-art methods trained on the same datasets in an exhaustive experimental evaluation of five widely used benchmark datasets (Botany, Alvermann Konzilsprotokolle, George Washington, Esposalles, and RIMES).</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112036"},"PeriodicalIF":7.5,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144563649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mixture of coarse and fine-grained prompt tuning for vision-language model","authors":"Yansheng Gao , Zixi Zhu , Shengsheng Wang","doi":"10.1016/j.patcog.2025.112074","DOIUrl":"10.1016/j.patcog.2025.112074","url":null,"abstract":"<div><div>Visual Language Models (VLMs) exhibit impressive performance across various tasks but often suffer from degradation of prior knowledge when transferred to downstream tasks with limited computational samples. Prompt tuning methods emerge as an effective solution to mitigate this issue. However, most existing approaches solely rely on coarse-grained text prompt or fine-grained text prompt, which may limit the discriminative and generalization capabilities of VLMs. To address these limitations, we propose <strong>Mixture of Coarse and Fine-grained Prompt Tuning (MCFPT)</strong>, a novel method that integrates both coarse and fine-grained prompts to enhance the performance of VLMs. Inspired by the Mixture-of-Experts (MoE) mechanism, MCFPT incorporates a <strong>Mixed Fusion Module (MFM)</strong> to fuse and select coarse domain-shared text feature and fine-grained category-discriminative text feature to get the mixed feature. Additionally, a <strong>Dynamic Refinement Adapter (DRA)</strong> is introduced to adjust category distributions, ensuring consistency between refined and mixed text features. These components collectively improve the generalization and discriminative power of VLMs. Extensive experiments across four scenarios-base-to-new, few-shot classification, domain generalization, and cross-domain classification-demonstrate that MCFPT achieves exceptional performance compared to state-of-the-art methods, with significant improvements in HM scores across multiple datasets. Our findings highlight MCFPT as a robust approach for improving the adaptability and efficiency of Visual Language Models in diverse application domains.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112074"},"PeriodicalIF":7.5,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144557233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing outdoor vision: Binocular desnowing with dual-stream temporal transformer","authors":"En Yu, Jie Lu, Kaihao Zhang, Guangquan Zhang","doi":"10.1016/j.patcog.2025.112075","DOIUrl":"10.1016/j.patcog.2025.112075","url":null,"abstract":"<div><div>Video desnowing, aimed at removing snowflakes and enhancing the quality of videos, is a crucial yet intricate task essential for improving the effectiveness of outdoor vision systems. Compared to rain and haze, the inherent opacity and diverse morphology of snowflakes result in more pronounced background occlusions, thereby challenging the efficacy of current desnowing techniques, particularly those focusing solely on images or videos captured from a monocular perspective. To address these challenges, this paper proposes a Dual-Stream Temporal Transformer (DSTT) to advance snow removal and visual enhancement by leveraging comprehensive information from stereo views and spatial-temporal cues. More specifically, it incorporates a Dual-Stream Weight-shared Transformer (DSWT) module to exploit spatial information from different views. This module employs a hierarchical weight-sharing strategy to extract fused spatial features across different views from low-level to high-level layers. Subsequently, the Dual-Stream ConvLSTM (DS-CLSTM) module is introduced to capture temporal correlations across streaming frames. By combining temporal-spatial cues and complementary details from diverse views, videos can be effectively restored while preserving the original content’s details. In addition, two binocular snowy datasets – SnowKITTI2012 and SnowKITTI 2015 – are presented, providing a valuable resource for evaluating the binocular desnowing task. Comprehensive experiments evaluated on both synthetic and real-world snowy datasets demonstrate that our proposed method outperforms the state-of-the-art baselines.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112075"},"PeriodicalIF":7.5,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144563005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EC-SLAM: Effectively constrained neural RGB-D SLAM with TSDF hash encoding and joint optimization","authors":"Guanghao Li , Qi Chen , Yuxiang Yan , Jian Pu","doi":"10.1016/j.patcog.2025.112034","DOIUrl":"10.1016/j.patcog.2025.112034","url":null,"abstract":"<div><div>We introduce EC-SLAM, a real-time dense RGB-D Simultaneous Localization and Mapping (SLAM) system leveraging Neural Radiance Fields (NeRF). While recent NeRF-based SLAM systems have shown promising results, they have yet to exploit NeRF’s potential to estimate system state fully. EC-SLAM overcomes this limitation by using a Truncated Signed Distance Fields (TSDF) opacity function with sharp inductive bias to strengthen constraints in sparse parametric encodings, which reduces the number of model parameters and enhances accuracy. Additionally, our system employs a highly constrained global joint optimization approach coupled with a feature-based, uniform sampling algorithm, enabling efficient fusion between TSDF and sparse parametric encodings. This approach reinforces constraints on keyframes most relevant to the current frame, mitigates the influence of random sampling, and effectively utilizes NeRF’s implicit loop closure capability. Extensive evaluations and ablations on the Replica, ScanNet, and TUM datasets demonstrate state-of-the-art performance, achieving precise tracking and reconstruction while maintaining real-time operation at up to 21 FPS. The source code is available at <span><span>https://github.com/Lightingooo/EC-SLAM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112034"},"PeriodicalIF":7.5,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144548748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qin Xu , Sitong Li , Jiahui Wang , Bo Jiang , Bin Luo , Jinhui Tang
{"title":"Context-Semantic Quality Awareness Network for fine-grained visual categorization","authors":"Qin Xu , Sitong Li , Jiahui Wang , Bo Jiang , Bin Luo , Jinhui Tang","doi":"10.1016/j.patcog.2025.112033","DOIUrl":"10.1016/j.patcog.2025.112033","url":null,"abstract":"<div><div>Exploring and mining subtle yet distinctive features between sub-categories with similar appearances is crucial for fine-grained visual categorization (FGVC). However, the existing FGVC methods cannot mine discriminative features from low-quality samples, leading to a significant decline in performance. To address this issue, we propose a weakly supervised Context-Semantic Quality Awareness Network (CSQA-Net) for FGVC. Specifically, to assess and enhance the quality of multi-granularity visual representations, we propose the Multi-level Semantic Quality Evaluation (MSQE) module, composed of the Quality Probing (QP) classifier. To alleviate the scale confusion problems and accurately identify the local distinctive regions, the part navigator is developed. Moreover, the Multi-part and Multi-scale Cross-Attention (MMCA) module is designed to model the spatial contextual relationship between rich part descriptors and global semantics, thus capturing more discriminative details within the object. Finally, the context-aware features from MMCA and semantically enhanced features from MSQE are fed into the corresponding QP classifiers to evaluate the quality in real time, further boosting the discriminability. Comprehensive experiments on four popular and highly competitive datasets demonstrate the superiority of the proposed CSQA-Net in comparison with the state-of-the-art methods. Code is available at <span><span>https://github.com/zmisiter/CSQA-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112033"},"PeriodicalIF":7.5,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144535850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quanyong Liu , Yang Xu , Zebin Wu , Jiangtao Peng , Zhihui Wei
{"title":"Unbalanced episode meta-learning with Bi-Sparse contrastive network for hyperspectral target detection","authors":"Quanyong Liu , Yang Xu , Zebin Wu , Jiangtao Peng , Zhihui Wei","doi":"10.1016/j.patcog.2025.112030","DOIUrl":"10.1016/j.patcog.2025.112030","url":null,"abstract":"<div><div>Deep learning (DL) has been extensively applied to hyperspectral image target detection (HTD) with notable success. However, many existing DL-based methods focus on expanding the training samples to capture richer information, resulting in high computational costs and overfitting risks. Additionally, challenges such as complex data distributions and limited model transferability remain significant obstacles. To address these issues, we propose an unbalanced episode meta-learning with Bi-sparse contrastive network (UEML) for HTD. In contrast to directly modeling the target dataset, our approach leverages meta-learning to pre-train the model on a categorical dataset rich in label information, resulting in a universal detection model. Specifically, an unbalanced episode training paradigm is proposed for meta-task construction, which simulates the category-imbalance scenarios inherent to HTD by adaptively adjusting the support set, enabling the acquisition of content-agnostic yet task-relevant transferable meta-knowledge. Additionally, elastic sparsity constraints are imposed on the feature extraction process across both spatial and spectral dimensions, enhancing the model’s generalization and discriminative capabilities. During the fine-tuning phase, we employ a pseudo-sample generation strategy based on segmented sampling and spatial–spectral hybrid augmentation to construct the training set, allowing for more accurate and comprehensive sample extraction from complex background regions. This strategy effectively mitigates underfitting caused by insufficient information. Furthermore, contrastive learning is incorporated to address complexities arising by multi-class background characteristics in the pseudo-binary classification task, improving the stability of the detection model. Our proposed algorithm demonstrates rapid target detection capabilities, and experiments on six public datasets indicate that it performs significantly better than existing state-of-the-art methods. Code is available at: <span><span>https://github.com/QYo-Liu/UEML</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112030"},"PeriodicalIF":7.5,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144523897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dianlong You , Zexuan Li , Jiawei Shen , Zhao Yu , Shunfu Jin , Xindong Wu
{"title":"Disentangled representation learning with causal effect transmission in variational autoencoder","authors":"Dianlong You , Zexuan Li , Jiawei Shen , Zhao Yu , Shunfu Jin , Xindong Wu","doi":"10.1016/j.patcog.2025.112018","DOIUrl":"10.1016/j.patcog.2025.112018","url":null,"abstract":"<div><div>Disentangled Representation Learning in variational autoencoder (VAE) has emerged as a strategy to identify and disentangle underlying factors from observable data to improve recognition capabilities such as images, speeches, and biological signals. Existing disentanglement methods are mostly based on the prior assumption that latent variables are mutually independent, which is inconsistent with reality and fails to transmit causal effects among causal nodes. To address the above issues, we introduce a novel disentanglement representation learning model with causal effect transmission, named DRL<span><math><msub><mrow></mrow><mrow><mi>CET</mi></mrow></msub></math></span>. The main ideas of DRL<span><math><msub><mrow></mrow><mrow><mi>CET</mi></mrow></msub></math></span> involve (1) mapping encoded latent exogenous variables to causal variables and updating the causal structure by a constructed nonlinear/linear structural causal model (SCM), (2) designing hierarchical feature loss from discriminator to replace pixel-level loss in variational autoencoder for efficiently extracting causal features, and (3) aggregating causal information from adjacent nodes by a graph attention network (GAT) with intervention for transmitting causal effects. Extensive theoretical analyses and empirical studies on synthetic and real datasets demonstrate the effectiveness, viability, and superiority of our DRL<span><math><msub><mrow></mrow><mrow><mi>CET</mi></mrow></msub></math></span> over the state-of-the-arts. Our code is publicly available at <span><span>https://github.com/youdianlong/DRLCET.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112018"},"PeriodicalIF":7.5,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144523898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haojun Xu , Ling Hu , Qinsong Li , Shengjun Liu , Dong-ming Yan , Xinru Liu
{"title":"Point Geometrical Coulomb Force: An explicit and robust embedding for point cloud analysis","authors":"Haojun Xu , Ling Hu , Qinsong Li , Shengjun Liu , Dong-ming Yan , Xinru Liu","doi":"10.1016/j.patcog.2025.112025","DOIUrl":"10.1016/j.patcog.2025.112025","url":null,"abstract":"<div><div>Recently, most existing point cloud frameworks tend to utilize max pooling aggregation functions to aggregate local point cloud features. However, when handling data containing local high-frequency noise such as local drop, addition, and jitter, this mechanism leads to high-frequency noise that spreads from local to global and causes severe performance degradation. To address this issue, we creatively extend the concepts from the physical field, namely <em>electrostatic field</em> and <em>Coulomb force</em>into geometric processing. To be specific, we treat the entire point cloud placed in an electrostatic field and each point as a probe charge and then equip this field with a set of source charges according to the structure of the cloud. We endow these two types of charges with different electric quantities, which could encode informative geometrical structural information. By analogously computing the Coulomb force between the probe charge and its corresponding source charge, we finally propose an explicit embedding called Point Geometric Coulomb Force (PGCF) for each point. Due to the deep use of the structural information of the point cloud and the fact that the electrostatic field of each source charge could not be affected by the variations of the probe charges, the PGCF has been proven to provide richer geometric information while being robust to local noises. Using the PGCF combined with point coordinates as inputs can significantly improve the performances of existing 3D point cloud feature extraction frameworks, including point convolution, graph convolution, and point transformer, without additional parameters or computational overhead, thus not affecting their inference speed. Experimental results show that integrating the PGCF into existing works brings more desirable results in a wide range of 3D point cloud analysis tasks, including classification, part segmentation, and semantic segmentation.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112025"},"PeriodicalIF":7.5,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144535800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AMGSN: Adaptive mask-guide supervised network for debiased facial expression recognition","authors":"Tianlong Gu, Hao Li, Xuan Feng, Yiqin Luo","doi":"10.1016/j.patcog.2025.112023","DOIUrl":"10.1016/j.patcog.2025.112023","url":null,"abstract":"<div><div>Facial expression recognition plays a crucial role in understanding human emotions and behavior. However, existing models often exhibit biases and imbalance towards diverse expression classes. To address this problem, we propose an Adaptive Mask-Guide Supervised Network (AMGSN) to enhance the uniform performance of the facial expression recognition models. We propose an adaptive mask guidance mechanism to mitigate bias and ensure uniform performance across different expression classes. AMGSN focuses on learning the ability to distinguish facial features with under-expressed expressions by dynamically generating masks during pre-training. Specifically, we employ an asymmetric encoder–decoder architecture, where the encoder encodes only the unmasked visible regions, while the lightweight decoder reconstructs the original image using latent representations and mask markers. By utilizing dynamically generated masks and focusing on informative regions, these models effectively reduce the interference of confounding factors, thus enhancing the discriminative power of the learned representation. In the pre-training stage, we introduce the Attention-Based Mask Generator (ABMG) to identify salient regions of expressions. Additionally, we advance the Mask Ratio Update Strategy (MRUS), which utilizes image reconstruction loss, to adjust the mask ratio for each image during pre-training. In the finetune stage, debiased center loss and contrastive loss are introduced to optimize the network to ensure the overall performance of expression recognition. Extensive experimental results on several standard datasets demonstrate that the proposed AMGSN significantly improves both balance and accuracy compared to state-of-the-art methods. For example, AMGSN reached 89.34% on RAF-DB, and 62.83% on AffectNet, respectively, with a standard deviation of only 0.0746 and 0.0484. This demonstrates the effectiveness of our improvements<span><span><sup>1</sup></span></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112023"},"PeriodicalIF":7.5,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144548745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MDGP-forest: A novel deep forest for multi-class imbalanced learning based on multi-class disassembly and feature construction enhanced by genetic programming","authors":"Zhikai Lin , Yong Xu , Kunhong Liu , Liyan Chen","doi":"10.1016/j.patcog.2025.112070","DOIUrl":"10.1016/j.patcog.2025.112070","url":null,"abstract":"<div><div>Class imbalance is a significant challenge in the field of machine learning. Due to factors such as quantity differences and feature overlap among classes, the imbalance problem for multiclass classification is more difficult than that for binary one, which leads to the existing research primarily focusing on the binary classification scenario. This study proposes a novel deep forest algorithm with the aid of Genetic Programming (GP), MDGP-Forest, for the multiclass imbalance problem. MDGP-Forest utilizes Multi-class Disassembly and undersampling based on instance hardness between layers to obtain multiple binary classification datasets, each corresponding to a GP population for feature construction. The improved fitness function of GP assesses the incremental importance of the constructed features for enhanced vectors, introducing higher-order information into subsequent layers to improve predicted performance. Each GP population generates a set of new features that improve the separability of classes, empowering MDGP-Forest with the capability to address the challenge of overlapping features among multiple classes. We thoroughly evaluate the classification performance of MDGP-Forest on 35 datasets. The experimental results demonstrate that MDGP-Forest significantly outperforms existing methods in addressing multiclass imbalance problems, exhibiting superior predictive performance.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112070"},"PeriodicalIF":7.5,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}