{"title":"UniHDSA: A unified relation prediction approach for hierarchical document structure analysis","authors":"Jiawei Wang , Kai Hu , Qiang Huo","doi":"10.1016/j.patcog.2025.111617","DOIUrl":"10.1016/j.patcog.2025.111617","url":null,"abstract":"<div><div>Document structure analysis, aka document layout analysis, is crucial for understanding both the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. Hierarchical Document Structure Analysis (HDSA) specifically aims to restore the hierarchical structure of documents created using authoring software with hierarchical schemas. Previous research has primarily followed two approaches: one focuses on tackling specific subtasks of HDSA in isolation, such as table detection or reading order prediction, while the other adopts a unified framework that uses multiple branches or modules, each designed to address a distinct task. In this work, we propose a unified relation prediction approach for HDSA, called UniHDSA, which treats various HDSA sub-tasks as relation prediction problems and consolidates relation prediction labels into a unified label space. This allows a single relation prediction module to handle multiple tasks simultaneously, whether at a page-level or document-level structure analysis. By doing so, our approach significantly reduces the risk of cascading errors and enhances system’s efficiency, scalability, and adaptability. To validate the effectiveness of UniHDSA, we develop a multimodal end-to-end system based on Transformer architectures. Extensive experimental results demonstrate that our approach achieves state-of-the-art performance on a hierarchical document structure analysis benchmark, Comp-HRDoc, and competitive results on a large-scale document layout analysis dataset, DocLayNet, effectively illustrating the superiority of our method across all sub-tasks.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111617"},"PeriodicalIF":7.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143725454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyan Tian , Ye Jin , Zhao Zhang , Peng Liu , Xianglong Tang
{"title":"Video summarization with temporal-channel visual transformer","authors":"Xiaoyan Tian , Ye Jin , Zhao Zhang , Peng Liu , Xianglong Tang","doi":"10.1016/j.patcog.2025.111631","DOIUrl":"10.1016/j.patcog.2025.111631","url":null,"abstract":"<div><div>Video summarization task has gained widespread interest, benefiting from its valuable capabilities for efficient video browsing. Existing approaches generally focus on inter-frame temporal correlations, which may not be sufficient to identify crucial content because of the limited useful details that can be gleaned. To resolve these issues, we propose a novel transformer-based approach for video summarization, called Temporal-Channel Visual Transformer (TCVT). The proposed TCVT consists of three components, including a dual-stream embedding module, an inter-frame encoder, and an intra-segment encoder. The dual-stream embedding module creates the fusion embedding sequence by extracting visual features and short-range optical features, preserving appearance and motion details. The temporal-channel inter-frame correlations are learned by the inter-frame encoder with multiple temporal and channel attention modules. Meanwhile, the intra-segment representations are captured by the intra-segment encoder for the local temporal context modeling. Finally, we fuse the frame-level and segment-level representations for the frame-wise importance score prediction. Our network outperforms state-of-the-art methods on two benchmark datasets, with improvements from 55.3% to 56.9% on the SumMe dataset and from 69.3% to 70.4% on the TVSum dataset.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111631"},"PeriodicalIF":7.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143735020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liye Mei , Xinglong Hu , Zhaoyi Ye , Zhiwei Ye , Chuan Xu , Sheng Liu , Cheng Lei
{"title":"Visual fidelity and full-scale interaction driven network for infrared and visible image fusion","authors":"Liye Mei , Xinglong Hu , Zhaoyi Ye , Zhiwei Ye , Chuan Xu , Sheng Liu , Cheng Lei","doi":"10.1016/j.patcog.2025.111612","DOIUrl":"10.1016/j.patcog.2025.111612","url":null,"abstract":"<div><div>The objective of infrared and visible image fusion is to combine the unique strengths of source images into a single image that serves human visual perception and machine detection. The existing fusion networks are still lacking in the effective characterization and retention of source image features. To counter these deficiencies, we propose a visual fidelity and full-scale interaction driven network for infrared and visible image fusion, named VFFusion. First, a multi-scale feature encoder based on BiFormer is constructed, and a feature cascade interaction module is designed to perform full-scale interaction on features distributed across different scales. In addition, a visual fidelity branch is built to process multi-scale features in parallel with the fusion branch. Specifically, the visual fidelity branch uses blurred images for self-supervised training in the constructed auxiliary task, thereby obtaining an effective representation of the source image information. By exploring the complementary representational features of infrared and visible images as supervisory information, it constrains the fusion branch to retain the source image features in the fused image. Notably, the visual fidelity branch employs a multi-scale joint reconstruction loss, utilizing the rich supervisory signals provided by multi-scale original images to enhance the feature representation of targets at different scales, resulting in clear fusion of the targets. Extensive qualitative and quantitative comparative experiments are conducted on four datasets against nine advanced methods, demonstrating the superiority of our approach. The source code is available at <span><span>https://github.com/XingLongH/VFFusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111612"},"PeriodicalIF":7.5,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143725456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanbo Cheng , Chenyu Liu , Pengfei Hu , Zhenrong Zhang , Jiefeng Ma , Jun Du
{"title":"Bidirectional trained tree-structured decoder for Handwritten Mathematical Expression Recognition","authors":"Hanbo Cheng , Chenyu Liu , Pengfei Hu , Zhenrong Zhang , Jiefeng Ma , Jun Du","doi":"10.1016/j.patcog.2025.111599","DOIUrl":"10.1016/j.patcog.2025.111599","url":null,"abstract":"<div><div>The Handwritten Mathematical Expression Recognition (HMER) task is a critical branch in the field of Optical Character Recognition (OCR). Recent studies have demonstrated that incorporating bidirectional context information significantly improves the performance of HMER models. However, existing methods fail to effectively utilize bidirectional context information during the inference stage. Furthermore, current bidirectional training methods are primarily designed for string decoders and cannot adequately generalize to tree decoders, which offer superior generalization capabilities and structural analysis capacity. To overcome these limitations, we propose the Mirror-Flipped Symbol Layout Tree (MF-SLT) and Bidirectional Asynchronous Training (BAT) structure. Our method extends the bidirectional training strategy to the tree decoder, enabling more effective training by leveraging bidirectional information. Additionally, we analyze the impact of the visual and linguistic perception of the HMER model separately and introduce the Shared Language Modeling (SLM) mechanism. Through the SLM, we enhance the model’s robustness and generalization when dealing with visual ambiguity, especially in scenarios with abundant training data. Our approach has been validated through extensive experiments, demonstrating its ability to achieve new state-of-the-art results on the CROHME 2014, 2016, and 2019 datasets, as well as the HME100K dataset. The code used in our experiments will be publicly available at <span><span>https://github.com/Hanbo-Cheng/BAT.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111599"},"PeriodicalIF":7.5,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143725452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheng Shi , Linfeng Lu , Minghua Zhao , Xinhong Hei , Chi-Man Pun , Qiguang Miao
{"title":"Learning hyperspectral noisy label with global and local hypergraph laplacian energy","authors":"Cheng Shi , Linfeng Lu , Minghua Zhao , Xinhong Hei , Chi-Man Pun , Qiguang Miao","doi":"10.1016/j.patcog.2025.111606","DOIUrl":"10.1016/j.patcog.2025.111606","url":null,"abstract":"<div><div>Deep learning has achieved significant advancements in hyperspectral image (HSI) classification, yet it is highly dependent on the availability of high-quality labeled data. However, acquiring such labeled data for HSIs is often challenging due to the associated high costs and complexity. Consequently, the issue of classifying HSIs with noisy labels has garnered increasing attention. To address the negative effects of noisy labels, various methods have employed label correction strategies and have demonstrated promising results. Nevertheless, these techniques typically rely on correcting labels based on small-loss samples or neighborhood similarity. In high-noise environments, such methods often face unstable training processes, and the unreliability of neighborhood samples restricts their effectiveness. To overcome these limitations, this paper proposes a label correction method designed to address noisy labels in HSI classification by leveraging both global and local hypergraph structures to estimate label confidence and correct mislabeled samples. In contrast to traditional graph-based approaches, hypergraphs are capable of capturing higher-order relationships among samples, thereby improving the accuracy of label correction. The proposed method minimizes both global and local hypergraph Laplacian energies to enhance label consistency and accuracy across the dataset. Furthermore, contrastive learning and the Mixup technique are integrated to bolster the robustness and discriminative capabilities of HSI classification networks. Extensive experiments conducted on four publicly available hyperspectral datasets — University of Pavia (UP), Salinas Valley (SV), Kennedy Space Center (KSC), and WHU-Hi-HanChuan (HC) — demonstrate the superior performance of the proposed method, particularly in scenarios characterized by high levels of noise, where substantial improvements in classification accuracy are observed.methods. The code is available at <span><span>https://github.com/AAAA-CS/GLHLE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111606"},"PeriodicalIF":7.5,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic accumulated attention map for interpreting evolution of decision-making in vision transformer","authors":"Yi Liao , Yongsheng Gao , Weichuan Zhang","doi":"10.1016/j.patcog.2025.111607","DOIUrl":"10.1016/j.patcog.2025.111607","url":null,"abstract":"<div><div>Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods cannot display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected (FC) layers as the classifier but also self-supervised ViT models. The code is available at <span><span>https://github.com/ly9802/DynamicAccumulatedAttentionMap</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111607"},"PeriodicalIF":7.5,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143815697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Zhang , Wenhui Zhou , Lili Lin , Andrew Lumsdaine
{"title":"Cascade residual learning based adaptive feature aggregation for light field super-resolution","authors":"Hao Zhang , Wenhui Zhou , Lili Lin , Andrew Lumsdaine","doi":"10.1016/j.patcog.2025.111616","DOIUrl":"10.1016/j.patcog.2025.111616","url":null,"abstract":"<div><div>Light field (LF) super-resolution aims to enhance the spatial or angular resolutions of LF images. Most existing methods tend to decompose 4D LF images into multiple 2D subspaces such as spatial, angular, and epipolar plane image (EPI) domains, and devote efforts to designing various feature extractors for each subspace domain. However, it remains challenging to select an effective multi-domain feature fusion strategy, including the fusion order and structure. To this end, this paper proposes an adaptive feature aggregation framework based on cascade residual learning, which can adaptively select feature aggregation strategies through learning rather than designed artificially. Specifically, we first employ three types of 2D feature extractors for spatial, angular, and EPI feature extraction, respectively. Then, an adaptive feature aggregation (AFA) module is designed to cascade these feature extractors through multi-level residual connections. This design enables the network to flexibly aggregate various subspace features without introducing additional parameters. We conduct comprehensive experiments on both real-world and synthetic LF datasets for light field spatial super-resolution (LFSSR) and light field angular super-resolution (LFASR). Quantitative and visual comparisons demonstrate that our model achieves state-of-the-art super-resolution (SR) performance. The code is available at <span><span>https://github.com/haozhang25/AFA-LFSR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111616"},"PeriodicalIF":7.5,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143697607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifan Wang , Gerald Schaefer , Xiyao Liu , Jing Dong , Linglin Jing , Ye Wei , Xianghua Xie , Hui Fang
{"title":"Class activation map guided level sets for weakly supervised semantic segmentation","authors":"Yifan Wang , Gerald Schaefer , Xiyao Liu , Jing Dong , Linglin Jing , Ye Wei , Xianghua Xie , Hui Fang","doi":"10.1016/j.patcog.2025.111566","DOIUrl":"10.1016/j.patcog.2025.111566","url":null,"abstract":"<div><div>Weakly supervised semantic segmentation (WSSS) aims to achieve pixel-level fine-grained image segmentation using only weak guidance such as image-level class labels, thus significantly decreasing annotation costs. Despite the impressive performance showcased by current state-of-the-art WSSS approaches, the lack of precise object localisation limits their segmentation accuracy, especially for pixels close to object boundaries. To address this issue, we propose a novel class activation map (CAM)-based level set method to effectively improve the quality of pseudo-labels by exploring the capability of level sets to enhance the segmentation accuracy at object boundaries. To speed up the level set evolution process, we use Fourier neural operators to simulate the dynamic evolution of our level set method. Extensive experimental results show that our approach significantly outperforms existing WSSS methods on both PASCAL VOC 2012 and MS COCO datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111566"},"PeriodicalIF":7.5,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143768430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wan Xu , Tianyu Huang , Tianyuan Qu , Guanglei Yang , Yiwen Guo , Wangmeng Zuo
{"title":"FILP-3D: Enhancing 3D few-shot class-incremental learning with pre-trained vision-language models","authors":"Wan Xu , Tianyu Huang , Tianyuan Qu , Guanglei Yang , Yiwen Guo , Wangmeng Zuo","doi":"10.1016/j.patcog.2025.111558","DOIUrl":"10.1016/j.patcog.2025.111558","url":null,"abstract":"<div><div>Few-shot class-incremental learning (FSCIL) aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data. However, many of these works lack effective exploration of prior knowledge, rendering them unable to effectively address the domain gap issue in the context of 3D FSCIL, thereby leading to catastrophic forgetting. The Contrastive Vision-Language Pre-Training (CLIP) model serves as a highly suitable backbone for addressing the challenges of 3D FSCIL due to its abundant shape-related prior knowledge. Unfortunately, its direct application to 3D FSCIL still faces the incompatibility between 3D data representation and the 2D features, primarily manifested as feature space misalignment and significant noise. To address the above challenges, we introduce the FILP-3D framework with two novel components: the Redundant Feature Eliminator (RFE) for feature space misalignment and the Spatial Noise Compensator (SNC) for significant noise. RFE aligns the feature spaces of input point clouds and their embeddings by performing a unique dimensionality reduction on the feature space of pre-trained models (PTMs), effectively eliminating redundant information without compromising semantic integrity. On the other hand, SNC is a graph-based 3D model designed to capture robust geometric information within point clouds, thereby augmenting the knowledge lost due to projection, particularly when processing real-world scanned data. Moreover, traditional accuracy metrics are proven to be biased due to the imbalance in existing 3D datasets. Therefore we propose 3D FSCIL benchmark FSCIL3D-XL and novel evaluation metrics that offer a more nuanced assessment of a 3D FSCIL model. Experimental results on both established and our proposed benchmarks demonstrate that our approach significantly outperforms existing state-of-the-art methods. Code is available at: <span><span>https://github.com/HIT-leaderone/FILP-3D</span><svg><path></path></svg></span></div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111558"},"PeriodicalIF":7.5,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143714776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fan Yang, Xinqi Liu, Fumin Ma, Xiaojian Ding, Kaixiang Wang
{"title":"Online Asymmetric Supervised Discrete Cross-Modal Hashing for Streaming Multimedia Data","authors":"Fan Yang, Xinqi Liu, Fumin Ma, Xiaojian Ding, Kaixiang Wang","doi":"10.1016/j.patcog.2025.111604","DOIUrl":"10.1016/j.patcog.2025.111604","url":null,"abstract":"<div><div>Cross-modal online hashing, which uses freshly received data to retrain the hash function gradually, has become a research hotspot as a means of handling the massive amounts of streaming data that have been brought about by the fast growth of multimedia technology and the popularity of portable devices. However, in the process of processing stream data in most methods, on the one hand, the relationship between modal classes and the common features between label vectors and binary codes is not fully explored. On the other hand, the semantic information in the old and new data modes is not fully utilized. In this post, we offer Online Asymmetric Supervised Discrete Cross-Modal Hashing for Streaming Multimedia Data (OASCH) as a solution. This study integrates the concept cognition mechanism of dynamic incremental samples and an asymmetric knowledge guidance mechanism into the online hash learning framework. The proposed algorithmic model takes into account the knowledge similarity between newly arriving data and the existing dataset, as well as the knowledge similarity within the new data itself. It projects the hash codes associated with new incoming sample data into the potential space of concept cognition. By doing so, the model maximizes the mining of implicit semantic similarities within streaming data across different time points, resulting in the generation of compact hash codes with enhanced discriminative power, we further propose an adaptive edge regression strategy. Our method surpasses several current sophisticated cross-modal hashing techniques regarding both retrieval efficiency and search accuracy, according to studies on three publicly available multimedia retrieval datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111604"},"PeriodicalIF":7.5,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143697015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}