可微分的锚注意力,高效的视觉变压器

IF 3.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jiquan Shan , Junxiao Wang , Lifeng Zhao , Liang Cai , Hongyuan Zhang , Ioannis Liritzis
{"title":"可微分的锚注意力,高效的视觉变压器","authors":"Jiquan Shan ,&nbsp;Junxiao Wang ,&nbsp;Lifeng Zhao ,&nbsp;Liang Cai ,&nbsp;Hongyuan Zhang ,&nbsp;Ioannis Liritzis","doi":"10.1016/j.patrec.2025.07.016","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given <span><math><mi>n</mi></math></span> patches, they will have quadratic complexity such as <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (<strong>AnchorFormer</strong>), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>m</mi><mi>n</mi><mo>)</mo></mrow></mrow></math></span>, where <span><math><mi>m</mi></math></span> is an anchor number and <span><math><mrow><mi>m</mi><mo>&lt;</mo><mi>n</mi></mrow></math></span>. Notably, by representing the anchors with the neurons in a neural layer, we can differentiably learn these anchors and approximate global self-attention through the Markov process. It avoids the burden caused by non-differentiable operations and further speeds up the approximate attention. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of AnchorFormer, e.g., achieving up to a <em><strong>9.0%</strong></em> higher accuracy or <em><strong>46.7%</strong></em> FLOPs reduction on ImageNet classification, <em><strong>81.3%</strong></em> higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"197 ","pages":"Pages 124-131"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AnchorFormer: Differentiable anchor attention for efficient vision transformer\",\"authors\":\"Jiquan Shan ,&nbsp;Junxiao Wang ,&nbsp;Lifeng Zhao ,&nbsp;Liang Cai ,&nbsp;Hongyuan Zhang ,&nbsp;Ioannis Liritzis\",\"doi\":\"10.1016/j.patrec.2025.07.016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given <span><math><mi>n</mi></math></span> patches, they will have quadratic complexity such as <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (<strong>AnchorFormer</strong>), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>m</mi><mi>n</mi><mo>)</mo></mrow></mrow></math></span>, where <span><math><mi>m</mi></math></span> is an anchor number and <span><math><mrow><mi>m</mi><mo>&lt;</mo><mi>n</mi></mrow></math></span>. Notably, by representing the anchors with the neurons in a neural layer, we can differentiably learn these anchors and approximate global self-attention through the Markov process. It avoids the burden caused by non-differentiable operations and further speeds up the approximate attention. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of AnchorFormer, e.g., achieving up to a <em><strong>9.0%</strong></em> higher accuracy or <em><strong>46.7%</strong></em> FLOPs reduction on ImageNet classification, <em><strong>81.3%</strong></em> higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.</div></div>\",\"PeriodicalId\":54638,\"journal\":{\"name\":\"Pattern Recognition Letters\",\"volume\":\"197 \",\"pages\":\"Pages 124-131\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-07-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167865525002673\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002673","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

近年来,视觉转换器(ViTs)通过测量图像块之间的全局自关注,在视觉任务中取得了优异的性能。给定n个patch,它们将具有O(n2)之类的二次复杂度,并且在分割小粒度的输入图像时时间成本很高。同时,关键信息通常在输入图像的几个区域随机收集,一些标记可能对下游任务没有帮助。为了解决这个问题,我们引入了一种基于锚点的高效视觉转换器(AnchorFormer),它利用锚点令牌来学习关键信息并加速推理。首先,通过估计锚点与令牌之间的二部分关注,将复杂度从O(n2)降低到O(mn),其中m为锚点个数,m<n。值得注意的是,通过用神经层中的神经元表示锚点,我们可以微分地学习这些锚点,并通过马尔可夫过程近似全局自注意。它避免了不可微运算带来的负担,进一步加快了近似注意速度。此外,我们将提出的模型扩展到三个下游任务,包括分类、检测和分割。大量的实验表明了AnchorFormer的有效性,例如,与当前基线相比,在ImageNet分类上实现了高达9.0%的准确率提高或46.7%的FLOPs减少,在可比FLOPs下COCO检测的mAP提高了81.3%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
AnchorFormer: Differentiable anchor attention for efficient vision transformer
Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given n patches, they will have quadratic complexity such as O(n2) and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from O(n2) to O(mn), where m is an anchor number and m<n. Notably, by representing the anchors with the neurons in a neural layer, we can differentiably learn these anchors and approximate global self-attention through the Markov process. It avoids the burden caused by non-differentiable operations and further speeds up the approximate attention. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of AnchorFormer, e.g., achieving up to a 9.0% higher accuracy or 46.7% FLOPs reduction on ImageNet classification, 81.3% higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Pattern Recognition Letters
Pattern Recognition Letters 工程技术-计算机:人工智能
CiteScore
12.40
自引率
5.90%
发文量
287
审稿时长
9.1 months
期刊介绍: Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信