Knowledge-Based Systems最新文献_第3页

Ray-decomposed and gradient-constrained NeRF for few-shot view synthesis under low-light conditions 基于光线分解和梯度约束的低光照条件下少镜头视图合成NeRF

IF 7.6 1区计算机科学

Knowledge-Based Systems Pub Date : 2025-10-11 DOI: 10.1016/j.knosys.2025.114568

Feng Wang, Liju Yin, Yiming Qin, Xiaoning Gao, Xiangyu Tang, Hui Zhou

{"title":"Ray-decomposed and gradient-constrained NeRF for few-shot view synthesis under low-light conditions","authors":"Feng Wang, Liju Yin, Yiming Qin, Xiaoning Gao, Xiangyu Tang, Hui Zhou","doi":"10.1016/j.knosys.2025.114568","DOIUrl":"10.1016/j.knosys.2025.114568","url":null,"abstract":"<div><div>Neural Radiance Fields (NeRF) have shown impressive performance in novel view synthesis, providing high-quality visual results for 3D reconstruction. However, existing NeRF-based methods often fail under extreme low-light conditions with sparse-view inputs, suffering from color distortion and degraded visual quality due to inaccurate illumination modeling and overfitting to limited views. To address these challenges, we propose R-GNeRF, a novel framework that leverages ray decomposition and gradient constraint. Specifically, we decompose sampled rays into reflective and illumination components, each modeled by an independent MLP in an unsupervised manner. A gradient constraint guides the network to learn physically plausible illumination fields, allowing the synthesis of novel views under normal lighting using only the reflective component. In addition, we introduce a view-consistency annealing strategy that adaptively adjusts the sampling sphere radius based on projection consistency across views, mitigating overfitting and improving reconstruction of fine details in few-shot synthesis. To evaluate performance under extreme low-light, we construct the 3L-P dataset using a multi-pixel photon counter (MPPC) at illuminance levels of <span><math><msup><mn>10</mn><mrow><mo>−</mo><mn>3</mn></mrow></msup></math></span> and <span><math><msup><mn>10</mn><mrow><mo>−</mo><mn>4</mn></mrow></msup></math></span> lux, providing challenging low-light images. Extensive experiments demonstrate that R-GNeRF consistently outperforms existing methods in low-light few-shot novel view synthesis, achieving higher visual fidelity and accurate depth reconstruction while maintaining efficient rendering.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114568"},"PeriodicalIF":7.6,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145325934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Quantum entropy structural encoding for graph neural networks 图神经网络的量子熵结构编码

IF 7.6 1区计算机科学

Knowledge-Based Systems Pub Date : 2025-10-11 DOI: 10.1016/j.knosys.2025.114580

Feng Ding , Yingbo Wang , Shuo Yu , Yanming Shen

{"title":"Quantum entropy structural encoding for graph neural networks","authors":"Feng Ding , Yingbo Wang , Shuo Yu , Yanming Shen","doi":"10.1016/j.knosys.2025.114580","DOIUrl":"10.1016/j.knosys.2025.114580","url":null,"abstract":"<div><div>Structural encoding (SE) can improve the expressive power of Graph Neural Networks (GNNs). However, current SE methods have limited expressive power because they have limitations in capturing (1) node subgraphs, (2) global position of nodes, and (3) global structure of the graph. To tackle this challenge, we propose a Quantum Entropy Structural Encoding (QESE) for GNNs. For limitations (1) and (3), we employ quantum entropy on node subgraphs and the whole graph to recognize highly similar structures. For limitation (2), we apply quantum entropy on complement parts of node subgraphs for locating node positions. Then, we obtain QESE by integrating quantum entropies of these three parts through the Holevo <span><math><mi>χ</mi></math></span> quantity. Notably, we prove that QESE always captures structural distinction in node subgraphs and the whole graph, and the Holevo <span><math><mi>χ</mi></math></span> quantity empowers QESE to represent global position of nodes. We theoretically show that QESE distinguishes strongly regular graphs that 3-WL fails to, and has the potential to be more powerful than <span><math><mi>k</mi></math></span>-WL (<span><math><mi>k</mi></math></span>>3). We adopt a plug-and-play approach to inject QESE with existing GNNs, and further design an approximated version to reduce computational complexity. Experimental results show that QESE uplifts the expressive power of GNNs beyond 3-WL and indeed captures node subgraphs. Furthermore, QESE improves the performance of various GNNs in graph learning tasks and also surpasses other SE methods.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114580"},"PeriodicalIF":7.6,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145363168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal image generation and fusion through content-style hybrid disentanglement 多模态图像生成和融合，通过内容风格的混合解纠缠

IF 7.6 1区计算机科学

Knowledge-Based Systems Pub Date : 2025-10-11 DOI: 10.1016/j.knosys.2025.114597

Xu Cao , Huanxin Zou , Jun Li , Hao Chen , Xinyi Ying , Shitian He , Yingqian Wang , Liyuan Pan

{"title":"Multimodal image generation and fusion through content-style hybrid disentanglement","authors":"Xu Cao , Huanxin Zou , Jun Li , Hao Chen , Xinyi Ying , Shitian He , Yingqian Wang , Liyuan Pan","doi":"10.1016/j.knosys.2025.114597","DOIUrl":"10.1016/j.knosys.2025.114597","url":null,"abstract":"<div><div>Multimodal image fusion and cross-modal translation are fundamental yet challenging tasks in computer vision, with their performance directly impacting downstream applications. Existing approaches typically treat these tasks independently, developing specialized models that fail to exploit the intrinsic relationships between different modalities. This limitation not only restricts model generalizability but also hinders further performance improvements. In this paper, we propose a joint optimization framework for image generation and fusion. Specifically, we generalize multimodal image tasks as the fusion and transformation of cross-modal features, and design a hybrid task training strategy. At the data level, we introduce a self-supervised and mutual-supervised hybrid mechanism for content-style feature decoupling, which achieves superior feature separation through stepwise training on intra-modal and cross-modal data. At the model level, we construct a triple-branch decoupling head along with fusion and transformation modules to ensure synchronous and efficient execution of dual tasks. Our method not only breaks through the single task limitation of the model, but also innovatively introduces mixed supervision into multimodal processing. We conduct comprehensive experiments covering four modalities fusion tasks on seven popular datasets. Extensive experimental results demonstrate that our method achieves superior performance on two tasks as compared of the respective state-of-the-art methods, and show impressive cross-task generalization capability.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114597"},"PeriodicalIF":7.6,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145325613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LQMF-RD: A lightweight quantum-driven multi-modal fusion framework for rumor detection LQMF-RD：一种用于谣言检测的轻量级量子驱动多模态融合框架

IF 7.6 1区计算机科学

Knowledge-Based Systems Pub Date : 2025-10-10 DOI: 10.1016/j.knosys.2025.114633

Keliang Jia, Fanxu Meng, Ziwen Chen, Mengyao Du, Jing Liang

{"title":"LQMF-RD: A lightweight quantum-driven multi-modal fusion framework for rumor detection","authors":"Keliang Jia, Fanxu Meng, Ziwen Chen, Mengyao Du, Jing Liang","doi":"10.1016/j.knosys.2025.114633","DOIUrl":"10.1016/j.knosys.2025.114633","url":null,"abstract":"<div><div>In recent years, automated rumor detection has garnered significant attention. Despite notable progress in multi-modal modeling for social media rumor detection, two major challenges remain: (1) the dynamic characteristics of social networks during the propagation process are often overlooked; (2) multi-modal features (such as text, images, and propagation graphs), are often poorly aligned and lead to redundant model parameters. To address these issues, we propose LQMF-RD, a lightweight quantum-driven multi-modal feature fusion framework for rumor detection. First, to capture the dynamic nature of rumor propagation, we design a Dynamic Graph Network (DGN) that leverages the spatiotemporal characteristics of propagation graph, effectively capturing both neighborhood dependencies and temporal evolution among nodes. Then, we employ amplitude encoding to project the extracted multi-modal features into a <span><math><mrow><mo>[</mo><msub><mi>log</mi><mn>2</mn></msub><mi>N</mi><mo>]</mo></mrow></math></span>-dimensional quantum state space. Finally, we construct a Lightweight Quantum-driven Multi-modal Fusion Network (LQMFN), which enables deep interaction and fusion of multi-modal features through quantum convolution and pooling operations. LQMFN updates only 0.01M parameters, substantially reducing computational complexity and storage overhead. Experimental results show that LQMF-RD not only delivers superior performance on rumor detection tasks, but also achieves high computational efficiency and strong robustness to quantum noise.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114633"},"PeriodicalIF":7.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145325459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced magnetic resonance imaging feature extraction for precise brain tumor classification using dual deep convolutional networks 基于双深度卷积网络的增强磁共振成像特征提取用于脑肿瘤的精确分类

IF 7.6 1区计算机科学

Knowledge-Based Systems Pub Date : 2025-10-10 DOI: 10.1016/j.knosys.2025.114628

Denis Bernard , Constantino Msigwa , Jaeseok Yun

{"title":"Enhanced magnetic resonance imaging feature extraction for precise brain tumor classification using dual deep convolutional networks","authors":"Denis Bernard , Constantino Msigwa , Jaeseok Yun","doi":"10.1016/j.knosys.2025.114628","DOIUrl":"10.1016/j.knosys.2025.114628","url":null,"abstract":"<div><div>Precise and reliable classification of brain tumors is a critical prerequisite for effective medical diagnostics and the development of targeted treatment strategies. The complex and diverse structures of brain tumors such as their texture, size, and appearance pose significant challenges for deep learning models, often reducing their accuracy in identifying tumors from magnetic resonance imaging scans. To tackle this challenge, we introduce the Dual Deep Convolutional Brain Tumor Network, which combines a pre-trained Visual Geometry Group 19 model with a custom-designed Convolutional Neural Network to extract both fine-grained and high-level tumor features. By combining these complementary feature sets, the model enhances classification accuracy and robustness, providing a comprehensive understanding of the complex brain tumor landscape. The model’s effectiveness was validated through 10-fold cross-validation using the Kaggle brain tumor classification dataset, encompassing glioma, no tumor, meningioma, and pituitary categories. Experimental findings reveal that our model surpasses existing techniques, attaining 98.81 % accuracy, 97.69 % precision, 97.75 % recall, 99.18 % specificity, and an F1-score of 97.70 %. These results confirm that the integrated model provides a reliable and accurate solution for brain tumor classification, with significant implications for clinical diagnostics and treatment planning.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114628"},"PeriodicalIF":7.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145324965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SGD-font: Style and glyph decoupling for one-shot font generation SGD-font：一次性字体生成的样式和字形解耦

IF 7.6 1区计算机科学

Knowledge-Based Systems Pub Date : 2025-10-10 DOI: 10.1016/j.knosys.2025.114600

Zhenhua Li , Siyi Chen , Dong Liang

{"title":"SGD-font: Style and glyph decoupling for one-shot font generation","authors":"Zhenhua Li , Siyi Chen , Dong Liang","doi":"10.1016/j.knosys.2025.114600","DOIUrl":"10.1016/j.knosys.2025.114600","url":null,"abstract":"<div><div>Automatic font generation aims to generate a complete font library by learning the font style from reference samples. Font generation is challenging because it needs to generate a set of sheer quantity of characters with consistent style and complicated structures of glyphs with limited reference font images, especially when the character or font style is unseen during training. In this paper, we propose a glyph control-based diffusion model for one-shot font generation. Specifically, we employ a style encoder to extract multi-scale style features and incorporate them into the reverse denoising steps of the diffusion model via cross-attention-based style fusion blocks. Decoupling style and glyph enables the combination of arbitrary styles and glyphs in font creation and allows users to generate fonts with unseen styles and unseen glyphs. In the inference stage, we introduce a multi-condition sampling strategy to effectively align the desired style and target glyph. Comprehensive experiments and a user study show that our framework surpasses existing approaches for both seen and unseen fonts. We further demonstrate its capability for style interpolation and cross-lingual font generation. The code is available at <span><span>https://github.com/ChenSiyi1/SGD-Font</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114600"},"PeriodicalIF":7.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145324968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

KDP-MHL: Key data point-aware multi-scale hypergraph learning framework for multivariate time series classification 多变量时间序列分类的关键数据点感知多尺度超图学习框架

IF 7.6 1区计算机科学

Knowledge-Based Systems Pub Date : 2025-10-10 DOI: 10.1016/j.knosys.2025.114620

Nan Ma , Jiacheng Guo , Yajue Yang , Shuling Li , Zehao Wang , Yiheng Han

{"title":"KDP-MHL: Key data point-aware multi-scale hypergraph learning framework for multivariate time series classification","authors":"Nan Ma , Jiacheng Guo , Yajue Yang , Shuling Li , Zehao Wang , Yiheng Han","doi":"10.1016/j.knosys.2025.114620","DOIUrl":"10.1016/j.knosys.2025.114620","url":null,"abstract":"<div><div>Multivariate Time Series Classification faces inherent challenges due to complex high-order temporal correlations among data points and redundant data that obscure discriminative patterns. Existing methods primarily focus on modeling local or pairwise interactions while ignoring the distinction between informative and redundant data points. To capture informative high-order relationships underlying multi-scale temporal patterns, we propose the Key Data Point-Aware Multi-Scale Hypergraph Learning Framework (KDP-MHL) with an encoder-decoder architecture based on hypergraph neural networks. Throughout the framework, we develop a <em>Local-Enhanced Dynamic Hypergraph Propagation Layer</em> that extracts local-enhanced node features for each data point and obtains multi-scale high-order temporal associations by constructing dynamic hypergraphs among multiple nodes. To reduce redundancy, a <em>Key Data Point-Aware Module</em> is designed in the encoder to calculate node importance based on high-order attribute features and retain the key data points. In the decoder, a <em>Multiple Class Tokens Representation</em> method is introduced to guide high-order interactions between multiple class tokens and key data point features through hypergraph structure, further aggregating class-specific information from selected key data points, thereby improving the representation capability. Extensive experiments on 24 UEA datasets demonstrate that our method achieves superior performance compared to state-of-the-art approaches, with 3% improvement in average accuracy.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114620"},"PeriodicalIF":7.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145324975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Statistical matching using autoencoders-canonical correlation analysis, kernel canonical correlation analysis and multi-output multilayer perceptron 统计匹配使用自编码器-典型相关分析，核典型相关分析和多输出多层感知器

IF 7.6 1区计算机科学

Knowledge-Based Systems Pub Date : 2025-10-10 DOI: 10.1016/j.knosys.2025.114626

Hugues Annoye , Alessandro Beretta , Cédric Heuchenne

{"title":"Statistical matching using autoencoders-canonical correlation analysis, kernel canonical correlation analysis and multi-output multilayer perceptron","authors":"Hugues Annoye , Alessandro Beretta , Cédric Heuchenne","doi":"10.1016/j.knosys.2025.114626","DOIUrl":"10.1016/j.knosys.2025.114626","url":null,"abstract":"<div><div>A lot of data are gathered every day, whether via surveys or other sources. For many people, the need for variables from different data sources is a key factor and leads to the need of methods to combine them. A recognized practice to combine data sets in this field is statistical matching. In this paper, we investigate and extend to statistical matching an Autoencoders-Canonical Correlation Analysis (A-CCA). A-CCA is an extension of KCCA, that reduces the need for kernels, with the added benefit of a dimensionality reduction. It can be regarded as an extension of Deep Canonical Correlation Analysis (DCCA), providing enhanced flexibility that makes it well suited for statistical matching. This method is designed to deal with various variable types, sampling weights and incompatibilities among categorical variables. We compare the performance of this method with other methods based on Kernel Canonical Correlation Analysis (KCCA) or Multi-output Multilayer Perceptron (MMLP), using 2017 Belgian Statistics on Income and Living Conditions (SILC). We divide this data set in two parts and we act as if they were coming from two different sources.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114626"},"PeriodicalIF":7.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145325938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced deep learning and quantum variational classifier for large-scale data analysis 用于大规模数据分析的增强深度学习和量子变分分类器

IF 7.6 1区计算机科学

Knowledge-Based Systems Pub Date : 2025-10-10 DOI: 10.1016/j.knosys.2025.114611

Sudha D , Anju A , Ezhilarasi K

{"title":"Enhanced deep learning and quantum variational classifier for large-scale data analysis","authors":"Sudha D , Anju A , Ezhilarasi K","doi":"10.1016/j.knosys.2025.114611","DOIUrl":"10.1016/j.knosys.2025.114611","url":null,"abstract":"<div><div>Quantum machine learning (QML) is a method for analyzing vast volumes of health data, identifying possible higher-order interactions in medicine, and improving the accuracy of smart healthcare diagnosis and treatment. This paper presents a novel hybrid framework that integrates Inception-based Attentional VGG (IAV) with a Quantum Variational Classifier (QVC) and Parameterized Quantum Circuits (PQCs) for large-scale healthcare data analysis. Unlike existing models that face scalability, noise sensitivity, and high computational cost, the proposed approach combines deep learning feature extraction with quantum-enhanced classification to improve efficiency and accuracy. QML large-scale data are pre-processed with min-max normalization algorithms, which place feature values into a fixed range of uniformity and facilitate convergence learning. To extract features from pre-processed large-scale medical data analysis, Inception-based Attentional VGG is used. The quantum variational classifier is then utilized to categorize large-scale data in the classification method. Then, parameterized quantum circuits use a classical optimizer to get information about quantum measurements of parameters in tunable quantum functions. This model makes use of a dataset, namely the MIMIC-III clinical dataset, which is used to collect vast amounts of data for clinical health patients. The proposed model is then utilized to assess the performance of metrics like accuracy, precision, recall, and the F1 score. Experimental results show that the proposed approach achieves an accuracy of 98.76%, precision of 98.64%, recall of 98.12%, and F1-score of 98.86%, outperforming existing models such as SVM (89.23% accuracy), QSVM (90.13%), and QVKSVM (97.34%). These results demonstrate that the proposed hybrid QML–DL framework effectively handles high-dimensional clinical data, reduces computational overhead, and provides a strong foundation for next-generation healthcare analytics.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114611"},"PeriodicalIF":7.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145363326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Not all patches are crucial to image recognition: Window patch clustering attention for transformers 并非所有的斑块都对图像识别至关重要：窗口斑块聚类注意变压器

IF 7.6 1区计算机科学

Knowledge-Based Systems Pub Date : 2025-10-10 DOI: 10.1016/j.knosys.2025.114647

Ruoyu Wu, Yue Wang, Dongguang Li, Jintao Liu

{"title":"Not all patches are crucial to image recognition: Window patch clustering attention for transformers","authors":"Ruoyu Wu, Yue Wang, Dongguang Li, Jintao Liu","doi":"10.1016/j.knosys.2025.114647","DOIUrl":"10.1016/j.knosys.2025.114647","url":null,"abstract":"<div><div>Vision Transformer (VIT) effectively captures global and local image features by connecting and facilitating information transfer between image patches, making it an essential tool in computer vision. However, its computational cost has been a major limiting factor for its application. To reduce the computational cost introduced by the attention mechanism in the transformer architecture, researchers have explored two approaches: reducing the number of patches involved in the computation and innovating attention mechanisms. Although these methods have improved efficiency, they require manual preprocessing and additional model training compared to VIT, which limits their flexibility. In this work, we propose an adaptive attention pattern for vision transformers that is easily implemented within transformer architecture, and we designed a novel window transformer architecture for various vision tasks without any preprocessing or additional model training. Our method can determine which patches participate in self-attention calculations based on the similarity of image patches in multidimensional space, thereby reducing the computational cost of these calculations. Experimental results show that our method is more effective, with fewer patches involved in the attention calculation compared to the window attention architectures that do not incorporate the proposed attention block. Furthermore, to better understand the relationship between the transformer architecture and the input patches, we investigated the impact of different patches in images on the performance of transformer-based networks. We found that for typical window transformer architecture networks, only a subset of patches is crucial for accurate object recognition, while other patches primarily contribute to the confidence of the predictions.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114647"},"PeriodicalIF":7.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145325309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0