Jiquan Shan , Junxiao Wang , Lifeng Zhao , Liang Cai , Hongyuan Zhang , Ioannis Liritzis
{"title":"AnchorFormer: Differentiable anchor attention for efficient vision transformer","authors":"Jiquan Shan , Junxiao Wang , Lifeng Zhao , Liang Cai , Hongyuan Zhang , Ioannis Liritzis","doi":"10.1016/j.patrec.2025.07.016","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given <span><math><mi>n</mi></math></span> patches, they will have quadratic complexity such as <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (<strong>AnchorFormer</strong>), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>m</mi><mi>n</mi><mo>)</mo></mrow></mrow></math></span>, where <span><math><mi>m</mi></math></span> is an anchor number and <span><math><mrow><mi>m</mi><mo><</mo><mi>n</mi></mrow></math></span>. Notably, by representing the anchors with the neurons in a neural layer, we can differentiably learn these anchors and approximate global self-attention through the Markov process. It avoids the burden caused by non-differentiable operations and further speeds up the approximate attention. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of AnchorFormer, e.g., achieving up to a <em><strong>9.0%</strong></em> higher accuracy or <em><strong>46.7%</strong></em> FLOPs reduction on ImageNet classification, <em><strong>81.3%</strong></em> higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"197 ","pages":"Pages 124-131"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002673","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given patches, they will have quadratic complexity such as and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from to , where is an anchor number and . Notably, by representing the anchors with the neurons in a neural layer, we can differentiably learn these anchors and approximate global self-attention through the Markov process. It avoids the burden caused by non-differentiable operations and further speeds up the approximate attention. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of AnchorFormer, e.g., achieving up to a 9.0% higher accuracy or 46.7% FLOPs reduction on ImageNet classification, 81.3% higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.