并非所有的斑块都对图像识别至关重要：窗口斑块聚类注意变压器

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-10-10 DOI:10.1016/j.knosys.2025.114647

Ruoyu Wu, Yue Wang, Dongguang Li, Jintao Liu

{"title":"并非所有的斑块都对图像识别至关重要：窗口斑块聚类注意变压器","authors":"Ruoyu Wu, Yue Wang, Dongguang Li, Jintao Liu","doi":"10.1016/j.knosys.2025.114647","DOIUrl":null,"url":null,"abstract":"<div><div>Vision Transformer (VIT) effectively captures global and local image features by connecting and facilitating information transfer between image patches, making it an essential tool in computer vision. However, its computational cost has been a major limiting factor for its application. To reduce the computational cost introduced by the attention mechanism in the transformer architecture, researchers have explored two approaches: reducing the number of patches involved in the computation and innovating attention mechanisms. Although these methods have improved efficiency, they require manual preprocessing and additional model training compared to VIT, which limits their flexibility. In this work, we propose an adaptive attention pattern for vision transformers that is easily implemented within transformer architecture, and we designed a novel window transformer architecture for various vision tasks without any preprocessing or additional model training. Our method can determine which patches participate in self-attention calculations based on the similarity of image patches in multidimensional space, thereby reducing the computational cost of these calculations. Experimental results show that our method is more effective, with fewer patches involved in the attention calculation compared to the window attention architectures that do not incorporate the proposed attention block. Furthermore, to better understand the relationship between the transformer architecture and the input patches, we investigated the impact of different patches in images on the performance of transformer-based networks. We found that for typical window transformer architecture networks, only a subset of patches is crucial for accurate object recognition, while other patches primarily contribute to the confidence of the predictions.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114647"},"PeriodicalIF":7.6000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Not all patches are crucial to image recognition: Window patch clustering attention for transformers\",\"authors\":\"Ruoyu Wu, Yue Wang, Dongguang Li, Jintao Liu\",\"doi\":\"10.1016/j.knosys.2025.114647\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Vision Transformer (VIT) effectively captures global and local image features by connecting and facilitating information transfer between image patches, making it an essential tool in computer vision. However, its computational cost has been a major limiting factor for its application. To reduce the computational cost introduced by the attention mechanism in the transformer architecture, researchers have explored two approaches: reducing the number of patches involved in the computation and innovating attention mechanisms. Although these methods have improved efficiency, they require manual preprocessing and additional model training compared to VIT, which limits their flexibility. In this work, we propose an adaptive attention pattern for vision transformers that is easily implemented within transformer architecture, and we designed a novel window transformer architecture for various vision tasks without any preprocessing or additional model training. Our method can determine which patches participate in self-attention calculations based on the similarity of image patches in multidimensional space, thereby reducing the computational cost of these calculations. Experimental results show that our method is more effective, with fewer patches involved in the attention calculation compared to the window attention architectures that do not incorporate the proposed attention block. Furthermore, to better understand the relationship between the transformer architecture and the input patches, we investigated the impact of different patches in images on the performance of transformer-based networks. We found that for typical window transformer architecture networks, only a subset of patches is crucial for accurate object recognition, while other patches primarily contribute to the confidence of the predictions.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"330 \",\"pages\":\"Article 114647\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125016867\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125016867","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视觉变换（Vision Transformer， VIT）通过连接和促进图像补丁之间的信息传递，有效地捕获图像的全局和局部特征，是计算机视觉中的重要工具。然而，其计算成本一直是其应用的主要限制因素。为了降低变压器结构中注意机制带来的计算成本，研究人员探索了两种方法：减少计算涉及的补丁数量和创新注意机制。虽然这些方法提高了效率，但与VIT相比，它们需要人工预处理和额外的模型训练，这限制了它们的灵活性。在这项工作中，我们提出了一种易于在变压器架构中实现的视觉变压器自适应注意力模式，并且我们设计了一种新的窗口变压器架构，用于各种视觉任务，无需任何预处理或额外的模型训练。我们的方法可以根据图像patch在多维空间的相似性来确定哪些patch参与自关注计算，从而减少了这些计算的计算成本。实验结果表明，与不包含注意块的窗口注意力架构相比，我们的方法更有效，涉及的注意力计算补丁更少。此外，为了更好地理解变压器结构与输入补丁之间的关系，我们研究了图像中不同补丁对基于变压器的网络性能的影响。我们发现，对于典型的窗口转换器架构网络，只有一小部分补丁对准确的目标识别至关重要，而其他补丁主要有助于预测的置信度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Not all patches are crucial to image recognition: Window patch clustering attention for transformers

Vision Transformer (VIT) effectively captures global and local image features by connecting and facilitating information transfer between image patches, making it an essential tool in computer vision. However, its computational cost has been a major limiting factor for its application. To reduce the computational cost introduced by the attention mechanism in the transformer architecture, researchers have explored two approaches: reducing the number of patches involved in the computation and innovating attention mechanisms. Although these methods have improved efficiency, they require manual preprocessing and additional model training compared to VIT, which limits their flexibility. In this work, we propose an adaptive attention pattern for vision transformers that is easily implemented within transformer architecture, and we designed a novel window transformer architecture for various vision tasks without any preprocessing or additional model training. Our method can determine which patches participate in self-attention calculations based on the similarity of image patches in multidimensional space, thereby reducing the computational cost of these calculations. Experimental results show that our method is more effective, with fewer patches involved in the attention calculation compared to the window attention architectures that do not incorporate the proposed attention block. Furthermore, to better understand the relationship between the transformer architecture and the input patches, we investigated the impact of different patches in images on the performance of transformer-based networks. We found that for typical window transformer architecture networks, only a subset of patches is crucial for accurate object recognition, while other patches primarily contribute to the confidence of the predictions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.