Frequency-Assisted Local Attention in Lower Layers of Visual Transformers.

International journal of neural systems Pub Date : 2025-04-01 Epub Date: 2025-02-28 DOI:10.1142/S0129065725500157
Xin Zhou, Zeyu Jiang, Shihua Zhou, Zhaohui Ren, Yongchao Zhang, Tianzhuang Yu, Yulin Liu
{"title":"Frequency-Assisted Local Attention in Lower Layers of Visual Transformers.","authors":"Xin Zhou, Zeyu Jiang, Shihua Zhou, Zhaohui Ren, Yongchao Zhang, Tianzhuang Yu, Yulin Liu","doi":"10.1142/S0129065725500157","DOIUrl":null,"url":null,"abstract":"<p><p>Since vision transformers excel at establishing global relationships between features, they play an important role in current vision tasks. However, the global attention mechanism restricts the capture of local features, making convolutional assistance necessary. This paper indicates that transformer-based models can attend to local information without using convolutional blocks, similar to convolutional kernels, by employing a special initialization method. Therefore, this paper proposes a novel hybrid multi-scale model called Frequency-Assisted Local Attention Transformer (FALAT). FALAT introduces a Frequency-Assisted Window-based Positional Self-Attention (FWPSA) module that limits the attention distance of query tokens, enabling the capture of local contents in the early stage. The information from value tokens in the frequency domain enhances information diversity during self-attention computation. Additionally, the traditional convolutional method is replaced with a depth-wise separable convolution to downsample in the spatial reduction attention module for long-distance contents in the later stages. Experimental results demonstrate that FALAT-S achieves 83.0% accuracy on IN-1k with an input size of [Formula: see text] using 29.9[Formula: see text]M parameters and 5.6[Formula: see text]G FLOPs. This model outperforms the Next-ViT-S by 0.9[Formula: see text]AP<sup><i>b</i></sup>/0.8[Formula: see text]AP<sup><i>m</i></sup> with Mask-R-CNN [Formula: see text] on COCO and surpasses the recent FastViT-SA36 by 3.1% mIoU with FPN on ADE20k.</p>","PeriodicalId":94052,"journal":{"name":"International journal of neural systems","volume":" ","pages":"2550015"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of neural systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/S0129065725500157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/28 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Since vision transformers excel at establishing global relationships between features, they play an important role in current vision tasks. However, the global attention mechanism restricts the capture of local features, making convolutional assistance necessary. This paper indicates that transformer-based models can attend to local information without using convolutional blocks, similar to convolutional kernels, by employing a special initialization method. Therefore, this paper proposes a novel hybrid multi-scale model called Frequency-Assisted Local Attention Transformer (FALAT). FALAT introduces a Frequency-Assisted Window-based Positional Self-Attention (FWPSA) module that limits the attention distance of query tokens, enabling the capture of local contents in the early stage. The information from value tokens in the frequency domain enhances information diversity during self-attention computation. Additionally, the traditional convolutional method is replaced with a depth-wise separable convolution to downsample in the spatial reduction attention module for long-distance contents in the later stages. Experimental results demonstrate that FALAT-S achieves 83.0% accuracy on IN-1k with an input size of [Formula: see text] using 29.9[Formula: see text]M parameters and 5.6[Formula: see text]G FLOPs. This model outperforms the Next-ViT-S by 0.9[Formula: see text]APb/0.8[Formula: see text]APm with Mask-R-CNN [Formula: see text] on COCO and surpasses the recent FastViT-SA36 by 3.1% mIoU with FPN on ADE20k.

频率辅助局部注意在视觉变形器低层中的应用。
由于视觉变换擅长建立特征之间的全局关系,因此在当前的视觉任务中发挥着重要作用。然而,全局注意机制限制了局部特征的捕获,使得卷积辅助成为必要。本文指出,基于变压器的模型可以像卷积核一样,通过一种特殊的初始化方法来处理局部信息,而不需要使用卷积块。为此,本文提出了一种新的混合多尺度模型——频率辅助局部注意转换器(FALAT)。FALAT引入了一个基于频率辅助窗口的位置自注意(FWPSA)模块,该模块限制了查询令牌的注意距离,从而能够在早期捕获本地内容。频域值标记信息增强了自注意计算过程中的信息多样性。此外,在后期的长距离内容空间约简注意模块中,将传统的卷积方法替换为深度可分卷积向下采样。实验结果表明,当输入大小为[公式:见文]时,使用29.9[公式:见文]M个参数和5.6[公式:见文]G个FLOPs, FALAT-S在IN-1k上的准确率达到83.0%。该模型在COCO上优于next - vits 0.9[公式:见文]APb/0.8[公式:见文]APm,在Mask-R-CNN[公式:见文]上优于最近的fastvits - sa36,在ADE20k上具有FPN,高出3.1% mIoU。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信