Hangyuan Lu , Yong Yang , Shuying Huang , Rixian Liu , Huimin Guo
{"title":"MSAN: Multiscale self-attention network for pansharpening","authors":"Hangyuan Lu , Yong Yang , Shuying Huang , Rixian Liu , Huimin Guo","doi":"10.1016/j.patcog.2025.111441","DOIUrl":"10.1016/j.patcog.2025.111441","url":null,"abstract":"<div><div>Effective extraction of spectral–spatial features from multispectral (MS) and panchromatic (PAN) images is critical for high-quality pansharpening. However, existing deep learning methods often overlook local misalignment and struggle to integrate local and long-range features effectively, resulting in spectral and spatial distortions. To address these challenges, this paper proposes a refined detail injection model that adaptively learns injection coefficients using long-range features. Building upon this model, a multiscale self-attention network (MSAN) is proposed, consisting of a feature extraction branch and a self-attention mechanism branch. In the former branch, a two-stage multiscale convolution network is designed to fully extract detail features with multiple receptive fields. In the latter branch, a streamlined Swin Transformer (SST) is proposed to efficiently generate multiscale self-attention maps by learning the correlation between local and long-range features. To better preserve spectral–spatial information, a revised Swin Transformer block is proposed by incorporating spectral and spatial attention within the block. The obtained self-attention maps from SST serve as the injection coefficients to refine the extracted details, which are then injected into the upsampled MS image to produce the final fused image. Experimental validation demonstrates the superiority of MSAN over traditional and state-of-the-art methods, with competitive efficiency. The code of this work will be released on GitHub once the paper is accepted.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111441"},"PeriodicalIF":7.5,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143422188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guoqing Zhang , Jieqiong Zhou , Yuhui Zheng , Gaven Martin , Ruili Wang
{"title":"Adaptive transformer with Pyramid Fusion for cloth-changing Person Re-Identification","authors":"Guoqing Zhang , Jieqiong Zhou , Yuhui Zheng , Gaven Martin , Ruili Wang","doi":"10.1016/j.patcog.2025.111443","DOIUrl":"10.1016/j.patcog.2025.111443","url":null,"abstract":"<div><div>Recently, Transformer-based methods have made great progress in person re-identification (Re-ID), especially in handling identity changes in clothing-changing scenarios. Most current studies usually use biometric information-assisted methods such as human pose estimation to enhance the local perception ability of clothes-changing Re-ID. However, it is usually difficult for them to establish the connection between local biometric information and global identity semantics during training, resulting in the lack of local perception ability during the inference phase, which limits the improvement of model performance. In this paper, we propose a Transformer-based Adaptive-Aware Attention and Pyramid Fusion Network (<span><math><mrow><msup><mrow><mi>A</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>P</mi><mi>F</mi><mi>N</mi></mrow></math></span>) for CC Re-ID, which can capture and integrate multi-scale visual information to enhance recognition ability. Firstly, to improve the information utilization efficiency of the model in cloth-changing scenarios, we propose a Multi-Layer Dynamic Concentration module (MLDC) to evaluate the importance features at each layer in real time and reduce the computational overlap between related layers. Secondly, we propose a Local Pyramid Aggregation Module (LPAM) to extract multi-scale features, aiming to maintain global perceptual capability and focus on key local information. In this module, we also combine the Fast Fourier Transform (FFT) with self-attention mechanism to more effectively identify and analyze pedestrian gait and other structural details in the frequency domain and reduce the computational complexity of processing high-dimensional data in the self-attention mechanism. Finally, we build a new dataset incorporating diverse atmospheric conditions (for instance wind and rain) to more realistically simulate natural scenarios for the changing of clothes. Extensive experiments on multiple cloth-changing datasets clearly confirm the superior performance of <span><math><mrow><msup><mrow><mi>A</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>P</mi><mi>F</mi><mi>N</mi></mrow></math></span>. The dataset and related code are available on the website: <span><span>https://github.com/jieqiongz1999/vcclothes-w-r</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111443"},"PeriodicalIF":7.5,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143480564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pinxue Guo , Wei Zhang , Xiaoqiang Li , Jianping Fan , Wenqiang Zhang
{"title":"Self-supervised video object segmentation via pseudo label rectification","authors":"Pinxue Guo , Wei Zhang , Xiaoqiang Li , Jianping Fan , Wenqiang Zhang","doi":"10.1016/j.patcog.2025.111428","DOIUrl":"10.1016/j.patcog.2025.111428","url":null,"abstract":"<div><div>In this paper we propose a novel self-supervised framework for video object segmentation (VOS) which consists of siamese encoders and bi-decoders. Siamese encoders extract multi-level features and generate pseudo labels for each pixel by cross attention in visual-semantic space. Such siamese encoders are learned via the colorization task without any labeled video data. Bi-decoders take in features from different layers of the encoder and output refined segmentation masks. Such bi-decoders are trained by the pseudo labels, and in turn pseudo labels are rectified via bi-decoders mutual learning. The variation of the bi-decoders’ outputs is minimized such that the gap between pseudo labels and the ground-truth is reduced. Experimental results on the challenging datasets DAVIS-2017 and YouTube-VOS demonstrate the effectiveness of our proposed approach.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111428"},"PeriodicalIF":7.5,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143454227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio-visual representation learning via knowledge distillation from speech foundation models","authors":"Jing-Xuan Zhang , Genshun Wan , Jianqing Gao , Zhen-Hua Ling","doi":"10.1016/j.patcog.2025.111432","DOIUrl":"10.1016/j.patcog.2025.111432","url":null,"abstract":"<div><div>Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowledge distillation from SFMs. In our method, SFMs serve as teachers, from which multi-layer hidden representations are extracted using clean audio inputs. We also introduce a multi-teacher ensemble method to distill the student, which receives audio-visual data as inputs. A novel representational knowledge distillation loss is employed to train the student during pretraining, which is also applied during finetuning to further enhance the performance on downstream tasks. Our experiments utilized both a self-supervised SFM, WavLM, and a supervised SFM, iFLYTEK-speech. The results demonstrated that our proposed method achieved superior or at least comparable performance to previous state-of-the-art baselines across automatic speech recognition, visual speech recognition, and audio-visual speech recognition tasks. Additionally, comprehensive ablation studies and the visualization of learned representations were conducted to evaluate the effectiveness of our proposed method.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111432"},"PeriodicalIF":7.5,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143402804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dawei Zhao , Hong Li , Yixiang Lu , Dong Sun , Qingwei Gao
{"title":"An effective bipartite graph fusion and contrastive label correlation for multi-view multi-label classification","authors":"Dawei Zhao , Hong Li , Yixiang Lu , Dong Sun , Qingwei Gao","doi":"10.1016/j.patcog.2025.111430","DOIUrl":"10.1016/j.patcog.2025.111430","url":null,"abstract":"<div><div>Graph-based multi-view multi-label learning effectively utilizes the graph structure underlying the samples to integrate information from different views. However, most existing graph construction techniques are computationally complex. We propose an anchor-based bipartite graph fusion method to accelerate graph learning and perform label propagation. First, we employ an ensemble learning strategy that assigns weights to different views to capture complementary information. Second, heterogeneous graphs from different views are linearly fused to obtain a consensus graph, and graph comparative learning is utilized to bring inter-class relationships closer and enhance the quality of label correlation. Finally, we incorporate anchor samples into the decision-making process and jointly optimize the model using bipartite graph fusion and soft label classification with nonlinear extensions. Experimental results on multiple real-world benchmark datasets demonstrate the effectiveness and scalability of our approach compared to state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111430"},"PeriodicalIF":7.5,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143387706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qinghua Lin , Zuoyong Li , Kun Zeng , Jie Wen , Yuting Jiang , Jian Chen
{"title":"WtNGAN: Unpaired image translation from white light images to narrow-band images","authors":"Qinghua Lin , Zuoyong Li , Kun Zeng , Jie Wen , Yuting Jiang , Jian Chen","doi":"10.1016/j.patcog.2025.111431","DOIUrl":"10.1016/j.patcog.2025.111431","url":null,"abstract":"<div><div>As one of the most dangerous cancers, gastric cancer poses a serious threat to human health. Currently, gastroscopy remains the preferred method for gastric cancer diagnosis. In gastroscopy, white light and narrow-band light image are two necessary modalities providing deep learning-based multimodal-assisted diagnosis possibilities. However, there is no paired dataset of white-light images (WLIs) and narrow-band images (NBIs), which hinders the development of these methods. To address this problem, we propose an unpaired image-to-image translation network for translating WLI to NBI. Specifically, we first design a generative adversarial network based on Vision Mamba. The generator enhances the detailed representation capability by establishing long-range dependencies and generating images similar to authentic images. Then, we propose a structural consistency constraint to preserve the original tissue structure of the generated images. We also utilize contrastive learning (CL) to maximize the information interaction between the source and target domains. We conduct extensive experiments on a private gastroscopy dataset for translation between WLIs and NBIs. To verify the effectiveness of the proposed method, we also perform the translation between T1 and T2 magnetic resonance images (MRIs) on the BraTS 2021 dataset. The experimental results demonstrate that the proposed method outperforms state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111431"},"PeriodicalIF":7.5,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143422186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Max360IQ: Blind omnidirectional image quality assessment with multi-axis attention","authors":"Jiebin Yan, Ziwen Tan, Yuming Fang, Jiale Rao, Yifan Zuo","doi":"10.1016/j.patcog.2025.111429","DOIUrl":"10.1016/j.patcog.2025.111429","url":null,"abstract":"<div><div>Omnidirectional image, also called 360-degree image, is able to capture the entire 360-degree scene, thereby providing more realistic immersive feelings for users than general 2D image and stereoscopic image. Meanwhile, this feature brings great challenges to measuring the perceptual quality of omnidirectional images, which is closely related to users’ quality of experience, especially when the omnidirectional images suffer from non-uniform distortion. In this paper, we propose a novel and effective blind omnidirectional image quality assessment (BOIQA) model with multi-axis attention (Max360IQ), which can proficiently measure not only the quality of uniformly distorted omnidirectional images but also the quality of non-uniformly distorted omnidirectional images. Specifically, the proposed Max360IQ is mainly composed of a backbone with stacked multi-axis attention modules for capturing both global and local spatial interactions of extracted viewports, a multi-scale feature integration (MSFI) module to fuse multi-scale features and a quality regression module with deep semantic guidance for predicting the quality of omnidirectional images. Experimental results demonstrate that the proposed Max360IQ outperforms the state-of-the-art Assessor360 by 3.6% in terms of SRCC on the JUFE database with non-uniform distortion, and gains improvement of 0.4% and 0.8% in terms of SRCC on the OIQA and CVIQ databases, respectively. The source code is available at <span><span>https://github.com/WenJuing/Max360IQ</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111429"},"PeriodicalIF":7.5,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143422187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A deep learning approach for effective classification of fingerprint patterns and human behavior analysis","authors":"Atul Bhimrao Mokal, Brijendra Gupta","doi":"10.1016/j.patcog.2025.111439","DOIUrl":"10.1016/j.patcog.2025.111439","url":null,"abstract":"<div><div>The classification of fingerprints is an imperative guarantee for effective and precise fingerprint detection particularly when discovering fingerprints, but because of higher intra-class variations, smaller inter-class alterations, and clatters the prior techniques need to improve their performance. The aim is to provide a safe and convenient identification and authentication system. It offers more precise consultation for psychologists to identify the behavior of humans and catalogue them. The goal of the study is to identify the behavioral traits of the human based on fingerprint patterns. An automated deep model for the classification of fingerprints for analyzing human behaviours is provided in this paper. Gaussian filter is engaged for abandoning noise from an image. Thereafter the important features like texture-based features and minutiae features are extracted. For determining fingerprint patterns, a Deep Convolutional Neural Network (DCNN) is utilized. The Gannet Bald Optimization (GBO) is employed for training the DCNN to generate the classified patterns that include left loop, plain arch, right loop, tented arch, and whorl. Moreover, each classified pattern is matched with the dictionaries for human behaviour recognition. The proposed GBO-based DCNN obtained high performance and provided better competence when compared with the traditional models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111439"},"PeriodicalIF":7.5,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143508873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Knowledge enhanced prompt learning framework for financial news recommendation","authors":"ShaoBo Sun , Xiaoming Pan , Shuang Qi , Jun Gao","doi":"10.1016/j.patcog.2025.111461","DOIUrl":"10.1016/j.patcog.2025.111461","url":null,"abstract":"<div><div>The aim of financial news recommendation systems is to deliver personalized and timely financial information. Traditional methods face challenges, including the complexity of financial news, which requires stock-related external knowledge and accounts for users' interests in various stocks, industries, and concepts. Additionally, the financial domain's timeliness necessitates adaptable recommender systems, especially in few-shot and cold-start scenarios. To address these challenges, we propose a knowledge-enhanced prompt learning framework for financial news recommendation (FNRKPL). FNRKPL incorporates a financial news knowledge graph and transforms triple information into prompt language to strengthen the recommendation model's knowledge base. Personalized prompt templates are designed to account for users' topic preferences and sentiment tendencies, integrating knowledge, topic, and sentiment prompts. Furthermore, a knowledge-enhanced prompt learning mechanism enhances the model's generalization and adaptability in few-shot and cold-start scenarios. Extensive experiments on real-world corporate datasets validate FNRKPL's effectiveness in both data-rich and resource-poor conditions.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111461"},"PeriodicalIF":7.5,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143444503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel deep neural network for identification of sex and ethnicity based on unknown skulls","authors":"Haibo Zhang, Qianhong Li, Xizhi Wang, Qianyi Wu, Chaohui Ma, Mingquan Zhou, Guohua Geng","doi":"10.1016/j.patcog.2025.111450","DOIUrl":"10.1016/j.patcog.2025.111450","url":null,"abstract":"<div><div>The determination of the sex and ethnicity is crucial in the identification of unknown human remains in biology and forensic science. The components of these two biological traits can be effectively evaluated using the skull, which makes it one of the most essential structures for the aforementioned purpose. However, performing simultaneous determination of sex and ethnicity remains a challenge in the identification of unknown humans. In this study, a multi-attribute recognition framework for unknown skulls, which integrates multitask and multiview cross-attention, is proposed. Multi-angle images of the skull first serve as input to a parallel convolutional neural network, yielding its independent view features. To increase the performance of the skull multi-attribute recognition, a view cross-attention mechanism is then introduced. This mechanism uses the independent view features of the skull to obtain global view features. Afterwards, the final output structure is divided into two branches, one used to identify the gender of the skull and the other to identify its ethnicity. The experiment involves 214 samples that consist of 79 samples (41 males and 38 females) from the Han Chinese population in northern China and 135 samples (60 males and 75 females) from the Uyghur population in Xinjiang, China. The results of the experiment demonstrate that the optimal performance of the skull multi-attribute recognition model is obtained when ResNet18 is used as a feature-sharing network. The gender and ethnic identifications for the skull have accuracies of 95.94 % and 98.45 %, respectively. This verifies that the proposed method has high accuracy and generalization ability.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111450"},"PeriodicalIF":7.5,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143428173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}