{"title":"Not all patches are crucial to image recognition: Window patch clustering attention for transformers","authors":"Ruoyu Wu, Yue Wang, Dongguang Li, Jintao Liu","doi":"10.1016/j.knosys.2025.114647","DOIUrl":null,"url":null,"abstract":"<div><div>Vision Transformer (VIT) effectively captures global and local image features by connecting and facilitating information transfer between image patches, making it an essential tool in computer vision. However, its computational cost has been a major limiting factor for its application. To reduce the computational cost introduced by the attention mechanism in the transformer architecture, researchers have explored two approaches: reducing the number of patches involved in the computation and innovating attention mechanisms. Although these methods have improved efficiency, they require manual preprocessing and additional model training compared to VIT, which limits their flexibility. In this work, we propose an adaptive attention pattern for vision transformers that is easily implemented within transformer architecture, and we designed a novel window transformer architecture for various vision tasks without any preprocessing or additional model training. Our method can determine which patches participate in self-attention calculations based on the similarity of image patches in multidimensional space, thereby reducing the computational cost of these calculations. Experimental results show that our method is more effective, with fewer patches involved in the attention calculation compared to the window attention architectures that do not incorporate the proposed attention block. Furthermore, to better understand the relationship between the transformer architecture and the input patches, we investigated the impact of different patches in images on the performance of transformer-based networks. We found that for typical window transformer architecture networks, only a subset of patches is crucial for accurate object recognition, while other patches primarily contribute to the confidence of the predictions.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114647"},"PeriodicalIF":7.6000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125016867","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Vision Transformer (VIT) effectively captures global and local image features by connecting and facilitating information transfer between image patches, making it an essential tool in computer vision. However, its computational cost has been a major limiting factor for its application. To reduce the computational cost introduced by the attention mechanism in the transformer architecture, researchers have explored two approaches: reducing the number of patches involved in the computation and innovating attention mechanisms. Although these methods have improved efficiency, they require manual preprocessing and additional model training compared to VIT, which limits their flexibility. In this work, we propose an adaptive attention pattern for vision transformers that is easily implemented within transformer architecture, and we designed a novel window transformer architecture for various vision tasks without any preprocessing or additional model training. Our method can determine which patches participate in self-attention calculations based on the similarity of image patches in multidimensional space, thereby reducing the computational cost of these calculations. Experimental results show that our method is more effective, with fewer patches involved in the attention calculation compared to the window attention architectures that do not incorporate the proposed attention block. Furthermore, to better understand the relationship between the transformer architecture and the input patches, we investigated the impact of different patches in images on the performance of transformer-based networks. We found that for typical window transformer architecture networks, only a subset of patches is crucial for accurate object recognition, while other patches primarily contribute to the confidence of the predictions.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.