An Attention-Locating Algorithm for Eliminating Background Effects in Fine-Grained Visual Classification

IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Yueting Huang;Zhenzhe Hechen;Mingliang Zhou;Zhengguo Li;Sam Kwong
{"title":"An Attention-Locating Algorithm for Eliminating Background Effects in Fine-Grained Visual Classification","authors":"Yueting Huang;Zhenzhe Hechen;Mingliang Zhou;Zhengguo Li;Sam Kwong","doi":"10.1109/TCSVT.2025.3535818","DOIUrl":null,"url":null,"abstract":"Fine-grained visual classification (FGVC) is a challenging task characterized by interclass similarity and intraclass diversity and has broad application prospects. Recently, several methods have adopted the vision Transformer (ViT) in FGVC tasks since the data specificity of the multihead self-attention (MSA) mechanism in ViT is beneficial for extracting discriminative feature representations. However, these works focus on integrating feature dependencies at a high level, which leads to the model being easily disturbed by low-level background information. To address this issue, we propose a fine-grained attention-locating vision Transformer (FAL-ViT) and an attention selection module (ASM). First, FAL-ViT contains a two-stage framework to identify crucial regions effectively within images and enhance features by strategically reusing parameters. Second, the ASM accurately locates important target regions via the natural scores of the MSA, extracting finer low-level features to offer more comprehensive information through position mapping. Extensive experiments on public datasets demonstrate that FAL-ViT outperforms the other methods in terms of performance, confirming the effectiveness of our proposed methods. The source code is available at <uri>https://github.com/Yueting-Huang/FAL-ViT</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5993-6006"},"PeriodicalIF":11.1000,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10855837/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Fine-grained visual classification (FGVC) is a challenging task characterized by interclass similarity and intraclass diversity and has broad application prospects. Recently, several methods have adopted the vision Transformer (ViT) in FGVC tasks since the data specificity of the multihead self-attention (MSA) mechanism in ViT is beneficial for extracting discriminative feature representations. However, these works focus on integrating feature dependencies at a high level, which leads to the model being easily disturbed by low-level background information. To address this issue, we propose a fine-grained attention-locating vision Transformer (FAL-ViT) and an attention selection module (ASM). First, FAL-ViT contains a two-stage framework to identify crucial regions effectively within images and enhance features by strategically reusing parameters. Second, the ASM accurately locates important target regions via the natural scores of the MSA, extracting finer low-level features to offer more comprehensive information through position mapping. Extensive experiments on public datasets demonstrate that FAL-ViT outperforms the other methods in terms of performance, confirming the effectiveness of our proposed methods. The source code is available at https://github.com/Yueting-Huang/FAL-ViT.
一种细粒度视觉分类中消除背景影响的注意定位算法
细粒度视觉分类(FGVC)是一项具有类间相似性和类内多样性的挑战性任务,具有广阔的应用前景。由于多头自注意(MSA)机制的数据专用性有利于提取判别性特征表征,近年来有几种方法在FGVC任务中采用了视觉变换(ViT)。然而,这些工作侧重于在高层次上集成特征依赖,这导致模型容易受到底层背景信息的干扰。为了解决这个问题,我们提出了一个细粒度的注意定位视觉转换器(FAL-ViT)和一个注意选择模块(ASM)。首先,FAL-ViT包含一个两阶段框架,可以有效地识别图像中的关键区域,并通过策略性地重用参数来增强特征。其次,ASM通过MSA的自然分数准确定位重要目标区域,提取更精细的底层特征,通过位置映射提供更全面的信息。在公共数据集上的大量实验表明,FAL-ViT在性能方面优于其他方法,证实了我们提出的方法的有效性。源代码可从https://github.com/Yueting-Huang/FAL-ViT获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信