Runlin Cao , Zhixin Li , Zhenjun Tang , Canlong Zhang , Huifang Ma
{"title":"Enhancing robust VQA via contrastive and self-supervised learning","authors":"Runlin Cao , Zhixin Li , Zhenjun Tang , Canlong Zhang , Huifang Ma","doi":"10.1016/j.patcog.2024.111129","DOIUrl":"10.1016/j.patcog.2024.111129","url":null,"abstract":"<div><div>Visual Question Answering (VQA) aims to evaluate the reasoning abilities of an intelligent agent using visual and textual information. However, recent research indicates that many VQA models rely primarily on learning the correlation between questions and answers in the training dataset rather than demonstrating actual reasoning ability. To address this limitation, we propose a novel training approach called Enhancing Robust VQA via Contrastive and Self-supervised Learning (CSL-VQA) to construct a more robust VQA model. Our approach involves generating two types of negative samples to balance the biased data, using self-supervised auxiliary tasks to help the base VQA model overcome language priors, and filtering out biased training samples. In addition, we construct positive samples by removing spurious correlations in biased samples and perform auxiliary training through contrastive learning. Our approach does not require additional annotations and is compatible with different VQA backbones. Experimental results demonstrate that CSL-VQA significantly outperforms current state-of-the-art approaches, achieving an accuracy of 62.30% on the VQA-CP v2 dataset, while maintaining robust performance on the in-distribution VQA v2 dataset. Moreover, our method shows superior generalization capabilities on challenging datasets such as GQA-OOD and VQA-CE, proving its effectiveness in reducing language bias and enhancing the overall robustness of VQA models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111129"},"PeriodicalIF":7.5,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TransMatch: Transformer-based correspondence pruning via local and global consensus","authors":"Yizhang Liu , Yanping Li , Shengjie Zhao","doi":"10.1016/j.patcog.2024.111120","DOIUrl":"10.1016/j.patcog.2024.111120","url":null,"abstract":"<div><div>Correspondence pruning aims to filter out false correspondences (a.k.a. outliers) from the initial feature correspondence set, which is pivotal to matching-based vision tasks, such as image registration. To solve this problem, most existing learning-based methods typically use a multilayer perceptron framework and several well-designed modules to capture local and global contexts. However, few studies have explored how local and global consensuses interact to form cohesive feature representations. This paper proposes a novel framework called TransMatch, which leverages the full power of Transformer structure to extract richer features and facilitate progressive local and global consensus learning. In addition to enhancing feature learning, Transformer is used as a powerful tool to connect the above two consensuses. Benefiting from Transformer, our TransMatch is surprisingly effective for differentiating correspondences. Experimental results on correspondence pruning and camera pose estimation demonstrate that the proposed TransMatch outperforms other state-of-the-art methods by a large margin. The code will be available at <span><span>https://github.com/lyz8023lyp/TransMatch/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111120"},"PeriodicalIF":7.5,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"L2T-DFM: Learning to Teach with Dynamic Fused Metric","authors":"Zhaoyang Hai, Liyuan Pan, Xiabi Liu, Mengqiao Han","doi":"10.1016/j.patcog.2024.111124","DOIUrl":"10.1016/j.patcog.2024.111124","url":null,"abstract":"<div><div>The loss function plays a crucial role in the construction of machine learning algorithms. Employing a teacher model to set loss functions dynamically for student models has attracted attention. In existing works, (1) the characterization of the dynamic loss suffers from some inherent limitations, <em>ie</em>, the computational cost of loss networks and the restricted similarity measurement handcrafted loss functions; and (2) the states of the student model are provided to the teacher model directly without integration, causing the teacher model to underperform when trained on insufficient amounts of data. To alleviate the above-mentioned issues, in this paper, we select and weigh a set of similarity metrics by a confidence-based selection algorithm and a temporal teacher model to enhance the dynamic loss functions. Subsequently, to integrate the states of the student model, we employ statistics to quantify the information loss of the student model. Extensive experiments demonstrate that our approach can enhance student learning and improve the performance of various deep models on real-world tasks, including classification, object detection, and semantic segmentation scenarios.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111124"},"PeriodicalIF":7.5,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shurui Li , Liming Zhao , Chang Liu , Jing Jin , Cuntai Guan
{"title":"Self-distillation with beta label smoothing-based cross-subject transfer learning for P300 classification","authors":"Shurui Li , Liming Zhao , Chang Liu , Jing Jin , Cuntai Guan","doi":"10.1016/j.patcog.2024.111114","DOIUrl":"10.1016/j.patcog.2024.111114","url":null,"abstract":"<div><h3>Background:</h3><div>The P300 speller is one of the most well-known brain-computer interface (BCI) systems, offering users a novel way to communicate with their environment by decoding brain activity.</div></div><div><h3>Problem:</h3><div>However, most P300-based BCI systems require a longer calibration phase to develop a subject-specific model, which can be inconvenient and time-consuming. Additionally, it is challenging to implement cross-subject P300 classification due to significant inter-individual variations.</div></div><div><h3>Method:</h3><div>To address these issues, this study proposes a calibration-free approach for P300 signal detection. Specifically, we incorporate self-distillation along with a beta label smoothing method to enhance model generalization and overall system performance, which can not only enable the distillation of informative knowledge from the electroencephalogram (EEG) data of other subjects but effectively reduce individual variability.</div></div><div><h3>Experimental results:</h3><div>The results conducted on the publicly available OpenBMI dataset demonstrate that the proposed method achieves statistically significantly higher performance compared to state-of-the-art approaches. Notably, the average character recognition accuracy of our method reaches up to 97.37% without the need for calibration. And information transfer rate and visualization further confirm its effectiveness.</div></div><div><h3>Significance:</h3><div>This method holds great promise for future developments in BCI applications.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111114"},"PeriodicalIF":7.5,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders","authors":"Zuozhuo Dai , Kaihui Cheng , Fangtao Shao , Zilong Dong , Siyu Zhu","doi":"10.1016/j.patcog.2024.111099","DOIUrl":"10.1016/j.patcog.2024.111099","url":null,"abstract":"<div><div>State-of-the-art methods for text–video retrieval generally leverage CLIP embeddings and cosine similarity for efficient retrieval. Meanwhile, recent advancements in cross-attention techniques introduce transformer decoders to facilitate attention computation between text queries and visual tokens extracted from video frames, enabling a more comprehensive interaction between textual and visual information. In this study, we combine the advantages of both approaches and propose a fine-grained re-ranking approach incorporating a multi-grained text–video cross attention module. Specifically, the re-ranker enhances the top K similar candidates identified by the cosine similarity network. To explore video and text interactions efficiently, we introduce frame and video token selectors to obtain salient visual tokens at both frame and video levels. Then, a multi-grained cross-attention mechanism is applied between text and visual tokens at these levels to capture multimodal information. To reduce the training overhead associated with the multi-grained cross-attention module, we freeze the vision backbone and only train the multi-grained cross attention module. This frozen strategy allows for scalability to larger pre-trained vision models such as ViT-G, leading to enhanced retrieval performance. Experimental evaluations on text–video retrieval datasets showcase the effectiveness and scalability of our proposed re-ranker combined with existing state-of-the-art methodologies.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111099"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating the convergence of concept drift based on knowledge transfer","authors":"Husheng Guo , Zhijie Wu , Qiaoyan Ren , Wenjian Wang","doi":"10.1016/j.patcog.2024.111145","DOIUrl":"10.1016/j.patcog.2024.111145","url":null,"abstract":"<div><div>Concept drift detection and processing is an important issue in streaming data mining. When concept drift occurs, online learning model often cannot quickly adapt to the new data distribution due to the insufficient newly distributed data, which may lead to poor model performance. Currently, most online learning methods adapt to new data distributions after concept drift through autonomous adjustment of the model, but they may often fail to update the model to a stable state quickly. To solve these problems, this paper proposes an accelerating convergence method of concept drift based on knowledge transfer (<span><math><mrow><mi>ACC</mi><mtext>_</mtext><mi>KT</mi></mrow></math></span>). It extracts the most valuable information from the source domain (pre-drift data), and transfers it to the target domain (post-drift data), to realize the update of the ensemble model by knowledge transfer. Besides, different knowledge transfer patterns are adopted to accelerate convergence of model performance when different types concept drift occur. Experimental results show that the proposed method has an obvious acceleration effect on the online learning model after concept drift.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111145"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boeun Kim , Jungho Kim , Hyung Jin Chang , Tae-Hyun Oh
{"title":"A unified framework for unsupervised action learning via global-to-local motion transformer","authors":"Boeun Kim , Jungho Kim , Hyung Jin Chang , Tae-Hyun Oh","doi":"10.1016/j.patcog.2024.111118","DOIUrl":"10.1016/j.patcog.2024.111118","url":null,"abstract":"<div><div>Human action recognition remains challenging due to the inherent complexity arising from the combination of diverse granularity of semantics, ranging from the local motion of body joints to high-level relationships across multiple people. To learn this multi-level characteristic of human action in an unsupervised manner, we propose a novel pretraining strategy along with a transformer-based model architecture named <em>GL-Transformer++</em>. Prior methods in unsupervised action recognition or unsupervised group activity recognition (GAR) have shown limitations, often focusing solely on capturing a partial scope of the action, such as the local movements of each individual or the broader context of the overall motion. To tackle this problem, we introduce a novel pretraining strategy named <em>multi-interval pose displacement prediction (MPDP)</em> that enables the model to learn the diverse extents of the action. In the architectural aspect, we incorporate the <em>global and local attention (GLA)</em> mechanism within the transformer blocks to learn local dynamics between joints, global context of each individual, as well as high-level interpersonal relationships in both spatial and temporal manner. In fact, the proposed method is a unified approach that demonstrates efficacy in both action recognition and GAR. Particularly, our method presents a new and strong baseline, surpassing the current SOTA GAR method by significant margins: 29.6% in Volleyball and 60.3% and 59.9% on the xsub and xset settings of the Mutual NTU dataset, respectively.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111118"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoxuan Chen , Shuwen Xu , Shaohai Hu , Xiaole Ma
{"title":"ACFNet: An adaptive cross-fusion network for infrared and visible image fusion","authors":"Xiaoxuan Chen , Shuwen Xu , Shaohai Hu , Xiaole Ma","doi":"10.1016/j.patcog.2024.111098","DOIUrl":"10.1016/j.patcog.2024.111098","url":null,"abstract":"<div><div>Considering the prospects for image fusion, it is necessary to guide the fusion to adapt to downstream vision tasks. In this paper, we propose an Adaptive Cross-Fusion Network (ACFNet) that utilizes an adaptive approach to fuse infrared and visible images, addressing cross-modal differences to enhance object detection performance. In ACFNet, a hierarchical cross-fusion module is designed to enrich the features at each level of the reconstructed images. In addition, a special adaptive gating selection module is proposed to realize feature fusion in an adaptive manner so as to obtain fused images without the interference of manual design. Extensive qualitative and quantitative experiments have demonstrated that ACFNet is superior to current state-of-the-art fusion methods and achieves excellent results in preserving target information and texture details. The fusion framework, when combined with the object detection framework, has the potential to significantly improve the precision of object detection in low-light conditions.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111098"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MvWECM: Multi-view Weighted Evidential C-Means clustering","authors":"Kuang Zhou , Yuchen Zhu , Mei Guo , Ming Jiang","doi":"10.1016/j.patcog.2024.111108","DOIUrl":"10.1016/j.patcog.2024.111108","url":null,"abstract":"<div><div>Traditional multi-view clustering algorithms, designed to produce hard or fuzzy partitions, often neglect the inherent ambiguity and uncertainty in the cluster assignment of objects. This oversight may lead to performance degradation. To address these issues, this paper introduces a novel multi-view clustering method, termed MvWECM, capable of generating credal partitions within the framework of belief functions. The objective function of MvWECM is introduced considering the uncertainty in the cluster structure included in the multi-view dataset. We take into account inter-view conflict to effectively leverage coherent information across different views. Moreover, the effectiveness is heightened through the incorporation of adaptive view weights, which are customized to modulate their smoothness in accordance with their entropy. The optimization method to get the optimal credal membership and class prototypes is derived. The view wights can be also provided as a by-product. Experimental results on several real-word datasets demonstrate the effectiveness and superiority of MvWECM by comparing with some state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111108"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A robust fingerprint identification approach using a fuzzy system and novel rotation method","authors":"Ahmad A. Momani, László T. Kóczy","doi":"10.1016/j.patcog.2024.111134","DOIUrl":"10.1016/j.patcog.2024.111134","url":null,"abstract":"<div><div>Forensic science has developed significantly in the last few decades. Its key role is to provide crime investigators with processed data obtained from the crime scene to achieve more accurate results presented in court. Biometrics has proved its robustness against various critical crimes encountered by forensics experts. Fingerprints are the most important biometric used until now due to their uniqueness and production low cost. The automated fingerprint identification system (AFIS) came into existence in the early 1960s through the cooperation of the countries: USA, UK, France, and Japan. Ever since it started to develop gradually because of the challenges found at the crime scenes such as fingerprints distortions and partial cuts which in turn can severely affect the final calculations made by experts. The vagueness of the results was the main motivation to build a robust fingerprint identification system that introduces new and enhanced methods in its stages to help experts make more accurate decisions. The proposed fingerprint identification system uses Fourier domain analysis for image enhancement, then the system cuts the image around the core point after applying the rotation and core point detection methods. After that, it calculates the similarity based on the distance between fingerprint histograms extracted using the Local Binary Pattern (LBP). The system's last step is to translate the results into a sensible form where it utilizes fuzziness to provide more possibilities for the answer. The proposed identification system showed high efficiency on FVC 2002 and FVC 2000 databases. For instance, the results of applying our system on FVC 2002 provided a set of three ordered matching candidates such that 97.5 % of the results provided the correct candidate as the first order, and the rest of 2.5 % provided the correct candidate as the second order.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111134"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}