{"title":"Leveraging Transformer-based autoencoders for low-rank multi-view subspace clustering","authors":"Yuxiu Lin , Hui Liu , Xiao Yu , Caiming Zhang","doi":"10.1016/j.patcog.2024.111331","DOIUrl":"10.1016/j.patcog.2024.111331","url":null,"abstract":"<div><div>Deep multi-view subspace clustering is a hot research topic, aiming to integrate multi-view information to produce accurate cluster prediction. Limited by the inherent heterogeneity of distinct views, existing works primarily rely on view-specific encoding structures for representation learning. Although effective to some extent, this approach may hinder the full exploitation of view information and increase the complexity of model training. To this end, this paper proposes a novel low-rank multi-view subspace clustering method, TALMSC, backed by Transformer-based autoencoders. Specifically, we extend the self-attention mechanism to multi-view clustering settings, developing multiple Transformer-based autoencoders that allow for modality-agnostic representation learning. Based on extracted latent representations, we deploy a sample-wise weighted fusion module that incorporates contrastive learning and orthogonal operators to formulate both consistency and diversity, consequently generating a comprehensive joint representation. Moreover, TALMSC involves a highly flexible low-rank regularizer under the weighted Schatten <span><math><mi>p</mi></math></span>-norm to constrain self-expression and better explore the low-rank structure. Extensive experiments on five multi-view datasets show that our method enjoys superior clustering performance over state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111331"},"PeriodicalIF":7.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143146457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adapting ObjectBox for accurate hand detection","authors":"Yang Yang , Jun He , Xueliang Liu , Richang Hong","doi":"10.1016/j.patcog.2024.111315","DOIUrl":"10.1016/j.patcog.2024.111315","url":null,"abstract":"<div><div>Hand detection plays a crucial role in various computer vision applications, yet it has received limited research focus in recent years, lagging behind the generic object detection. In this work, we present HandBox to address this gap. HandBox leverages the capabilities of the advanced one-stage anchor-free object detector ObjectBox for accurate hand detection, in which we first scrutinize the limitations and shortcomings of ObjectBox in localizing small objects such as hands and subsequently put forward targeted remedies to enhance its performance. Experiments on two datasets, namely the Oxford-Hand dataset and the Contact-Hand dataset, show that HandBox outperforms ObjectBox by a large margin and achieves 86.21% and 87.79% <span><math><msub><mrow><mtext>AP</mtext></mrow><mrow><mn>50</mn></mrow></msub></math></span> respectively, setting a new benchmark for hand detection. Experiments on the MSCOCO dataset also showcase that our reformed HandBox is able to achieve better performance on generic object detection against ObjectBox, especially on detecting small objects. Codes will be made public at <span><span>https://github.com/HandDetector/HandBox</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111315"},"PeriodicalIF":7.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143147636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mixed hierarchy network for image restoration","authors":"Hu Gao, Ying Zhang, Jing Yang, Depeng Dang","doi":"10.1016/j.patcog.2024.111313","DOIUrl":"10.1016/j.patcog.2024.111313","url":null,"abstract":"<div><div>Image restoration is a long-standing low-level vision problem, e.g., deblurring, deraining and desnowing. In the process of image restoration, it is necessary to consider not only the spatial details and contextual information of restoration to ensure the quality but also the system complexity. Although many methods have been able to guarantee the quality of image restoration, the system complexity of the state-of-the-art (SOTA) methods is increasing as well. Motivated by this, we present a mixed hierarchy network that can balance these competing goals. Our main proposal is a mixed hierarchy architecture, that progressively recovers contextual information and spatial details from degraded images while we replace the nonlinear activation function with simple gate mechanism to reduce system complexity. Specifically, our model first learns the contextual information at the lower hierarchy using encoder–decoder architectures, and then at the higher hierarchy operates on full-resolution to retain spatial detail information. To facilitate information exchange, we design an adaptive feature fusion mechanism that selectively aggregates spatially-precise details and rich contextual information. In addition, we propose a selective multi-head attention mechanism with linear time complexity to adaptively retain the most crucial attention scores. The resulting tightly interlinked hierarchy architecture, named as MHNet, delivers strong performance gains on several image restoration tasks, including image desnowing, deraining, and deblurring. The code and the pre-trained models are released at <span><span>https://github.com/Tombs98/MHNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111313"},"PeriodicalIF":7.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143147641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Sultan Mahmud , Hua Zheng , Diego Garcia-Gil , Salvador García , Joshua Zhexue Huang
{"title":"RSPCA: Random Sample Partition and Clustering Approximation for ensemble learning of big data","authors":"Mohammad Sultan Mahmud , Hua Zheng , Diego Garcia-Gil , Salvador García , Joshua Zhexue Huang","doi":"10.1016/j.patcog.2024.111321","DOIUrl":"10.1016/j.patcog.2024.111321","url":null,"abstract":"<div><div>Large-scale data clustering needs an approximate approach for improving computation efficiency and data scalability. In this paper, we propose a novel method for ensemble clustering of large-scale datasets that uses the Random Sample Partition and Clustering Approximation (RSPCA) to tackle the problems of big data computing in cluster analysis. In the RSPCA computing framework, a big dataset is first partitioned into a set of disjoint random samples, called RSP data blocks that remain distributions consistent with that of the original big dataset. In ensemble clustering, a few RSP data blocks are randomly selected, and a clustering operation is performed independently on each data block to generate the clustering result of the data block. All clustering results of selected data blocks are aggregated to the ensemble result as an approximate result of the entire big dataset. To improve the robustness of the ensemble result, the ensemble clustering process can be conducted incrementally using multiple batches of selected RSP data blocks. To improve computation efficiency, we use the I-niceDP algorithm to automatically find the number of clusters in RSP data blocks and the <span><math><mi>k</mi></math></span>-means algorithm to determine more accurate cluster centroids in RSP data blocks as inputs to the ensemble process. Spectral and correlation clustering methods are used as the consensus functions to handle irregular clusters. Comprehensive experiment results on both real and synthetic datasets demonstrate that the ensemble of clustering results on a few RSP data blocks is sufficient for a good global discovery of the entire big dataset, and the new approach is computationally efficient and scalable to big data.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111321"},"PeriodicalIF":7.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143146441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yabin Zhu , Qianwu Wang , Chenglong Li , Jin Tang , Chengjie Gu , Zhixiang Huang
{"title":"Visible–thermal multiple object tracking: Large-scale video dataset and progressive fusion approach","authors":"Yabin Zhu , Qianwu Wang , Chenglong Li , Jin Tang , Chengjie Gu , Zhixiang Huang","doi":"10.1016/j.patcog.2024.111330","DOIUrl":"10.1016/j.patcog.2024.111330","url":null,"abstract":"<div><div>The complementary benefits from visible and thermal infrared data are extensively utilized in various computer vision tasks, such as visual tracking and object detection, but rarely explored in Multiple Object Tracking (MOT). This paper contributes a large-scale Visible–Thermal video benchmark for MOT, named VT-MOT, which presents several key advantages. First, it comprises 582 video sequence pairs with 401,000 frame pairs collected from diverse sources, including surveillance, drone, and handheld platforms. Second, VT-MOT has dense and high-quality annotations, with 3.99 million annotation boxes verified by professionals. To provide a strong baseline, we design a simple yet effective tracking framework, which effectively fuses temporal information and complementary information of two modalities in a progressive manner, for robust visible–thermal MOT. Comprehensive experiments validate the proposed method’s superiority over existing state-of-the-art methods, while potential future research directions for visible–thermal MOT are outlined. The project is released in <span><span>https://github.com/wqw123wqw/PFTrack</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111330"},"PeriodicalIF":7.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143147572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"S2Reg: Structure-semantics collaborative point cloud registration","authors":"Zongyi Xu , Xinyu Gao , Xinqi Jiang , Shiyang Cheng , Qianni Zhang , Weisheng Li , Xinbo Gao","doi":"10.1016/j.patcog.2024.111290","DOIUrl":"10.1016/j.patcog.2024.111290","url":null,"abstract":"<div><div>Point cloud registration is one of the essential tasks in 3D vision. However, most existing methods mainly locate the point correspondences based on geometric information or adopt semantic information to filter out incorrect correspondences. They overlook the underlying correlation between semantics and structure. In this paper, we propose a structure-semantics collaborative point cloud registration method. Firstly, we propose a <strong>S</strong>uperpoint <strong>S</strong>emantic <strong>F</strong>eature <strong>R</strong>epresentation module (SSFR), which incorporates multiple semantics of neighboring points to characterize the semantics of superpoints. Then, through a <strong>S</strong>tructural and <strong>S</strong>emantic <strong>F</strong>eature corre<strong>L</strong>ation with <strong>A</strong>ttention <strong>G</strong>uidance module (S<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>FLAG), we capture the global correlation of semantics and structure within a point cloud, as well as the consistency of semantics and structure between point clouds. Moreover, an image semantic segmentation foundation model is employed to acquire semantics when images of the point clouds are available. Extensive experiments demonstrate that our method achieves superior performance, especially in low-overlap scenarios. Our code and models are available at <span><span>https://github.com/GAOXINYU203/s2reg</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111290"},"PeriodicalIF":7.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143146437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Region-aware mutual relational knowledge distillation for semantic segmentation","authors":"Haowen Zheng , Xuxin Lin , Hailun Liang , Benjia Zhou , Yanyan Liang","doi":"10.1016/j.patcog.2024.111319","DOIUrl":"10.1016/j.patcog.2024.111319","url":null,"abstract":"<div><div>Existing knowledge distillation (KD) methods for semantic segmentation predominantly focus on transferring point-level and structure-level knowledge. While point-level KD methods align features pixel by pixel, structure-level KD approaches delve into relations encompassing intra-class variation and inter-class distance. However, considering either intra-class or inter-class relations results in incomplete relational knowledge. Moreover, distilling knowledge on the entire feature maps is susceptible to interference between object and background. To address these, we propose Region-aware Mutual Relational Knowledge Distillation (RMRKD) to fully leverage both intra-class and inter-class knowledge and explore mutual relations between teacher and student. Specifically, we introduce a region-aware module to decouple intra-class and inter-class knowledge and perform pixel-wise interaction to capture their mutual relations. The module further separates the background and foreground regions for output maps to mitigate interference. Extensive experiments on three challenging benchmarks demonstrate the effectiveness of RMRKD against state-of-the-art KD approaches.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111319"},"PeriodicalIF":7.5,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143147637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yufeng Wei , Cheng Lian , Bingrong Xu , Pengbo Zhao , Honggang Yang , Zhigang Zeng
{"title":"Bimodal Masked Autoencoders with internal representation connections for electrocardiogram classification","authors":"Yufeng Wei , Cheng Lian , Bingrong Xu , Pengbo Zhao , Honggang Yang , Zhigang Zeng","doi":"10.1016/j.patcog.2024.111311","DOIUrl":"10.1016/j.patcog.2024.111311","url":null,"abstract":"<div><div>Time series self-supervised methods have been widely used, with electrocardiogram (ECG) classification tasks also reaping their benefits. One mainstream paradigm is masked data modeling, which leverages the visible part of data to reconstruct the masked part, aiding in acquiring useful representations for downstream tasks. However, traditional approach predominantly attends to time domain information and places excessive demands on the encoder for reconstruction, thereby hurting model’s discriminative ability. In this paper, we present Bimodal Masked autoencoders with Internal Representation Connections (BMIRC) for ECG classification. On the one hand, BMIRC integrates the frequency spectrum of ECG into the masked pre-training process, enhancing the model’s comprehensive understanding of the ECG. On the other hand, it establishes internal representation connections (IRC) from the encoder to the decoder, which offers the decoder various levels of information to aid in reconstruction, thereby allowing the encoder to focus on modeling discriminative representations. We conduct comprehensive experiments across three distinct ECG datasets to validate the effectiveness of BMIRC. Experimental results demonstrate that BMIRC surpasses the competitive baselines across the majority of scenarios, encompassing both intra-domain (pre-training and fine-tuning on the same dataset) and cross-domain (pre-training and fine-tuning on different datasets) settings. The code is publicly available at <span><span>https://github.com/Envy-Clouds/BMIRC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111311"},"PeriodicalIF":7.5,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143146851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data-efficient multi-scale fusion vision transformer","authors":"Hao Tang, Dawei Liu, Chengchao Shen","doi":"10.1016/j.patcog.2024.111305","DOIUrl":"10.1016/j.patcog.2024.111305","url":null,"abstract":"<div><div>Vision transformers (ViTs) excel in image classification with large datasets but struggle with smaller ones. Vanilla ViTs are single-scale, tokenizing images into patches with a single patch size. In this paper, we introduce multi-scale tokens, where multiple scales are achieved by splitting images into patches of varying sizes. Our model concatenates token sequences of multiple scales for attention, and a regional cross-scale interaction module fuses these tokens, improving data efficiency by learning local structures across scales. Additionally, we implement a data augmentation schedule to refine training. Extensive experiments on image classification demonstrate our approach surpasses DeiT by 6.6% on CIFAR100 and 1.6% on ImageNet1K. Code is available at <span><span>https://github.com/visresearch/dems</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111305"},"PeriodicalIF":7.5,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143146853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting unlabeled data in few-shot learning with manifold similarity and label cleaning","authors":"Michalis Lazarou , Tania Stathaki , Yannis Avrithis","doi":"10.1016/j.patcog.2024.111304","DOIUrl":"10.1016/j.patcog.2024.111304","url":null,"abstract":"<div><div>Few-shot learning investigates how to solve novel tasks given limited labeled data. Exploiting unlabeled data along with the limited labeled has shown substantial improvement in performance. In this work we propose a novel algorithm that exploits unlabeled data in order to improve the performance of few-shot learning. We focus on transductive few-shot inference, where the entire test set is available at inference time, and semi-supervised few-shot learning where unlabeled data are available and can be exploited. Our algorithm starts by leveraging the manifold structure of the labeled and unlabeled data in order to assign accurate pseudo-labels to the unlabeled data. Iteratively, it selects the most confident pseudo-labels and treats them as labeled improving the quality of pseudo-labels at every iteration. Our method surpasses or matches the state of the art results on four benchmark datasets, namely <em>mini</em>ImageNet, <em>tiered</em>ImageNet, CUB and CIFAR-FS, while being robust over feature pre-processing and the quantity of available unlabeled data. Furthermore, we investigate the setting where the unlabeled data contains data from distractor classes and propose ideas to adapt our algorithm achieving new state of the art performance in the process. Specifically, we utilize the unnormalized manifold class similarities obtained from label propagation for pseudo-label cleaning and exploit the uneven pseudo-label distribution between classes to remove noisy data. The publicly available source code can be found at <span><span>https://github.com/MichalisLazarou/iLPC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111304"},"PeriodicalIF":7.5,"publicationDate":"2024-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143146852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}