{"title":"Webly Supervised Fine-Grained Classification by Integrally Tackling Noises and Subtle Differences","authors":"Junjie Chen;Jiebin Yan;Yuming Fang;Li Niu","doi":"10.1109/TIP.2025.3562740","DOIUrl":null,"url":null,"abstract":"Webly-supervised fine-grained visual classification (WSL-FGVC) aims to learn similar sub-classes from cheap web images, which suffers from two major issues: label noises in web images and subtle differences among fine-grained classes. However, existing methods for WSL-FGVC only focus on suppressing noise at image-level, but neglect to mine cues at pixel-level to distinguish the subtle differences among fine-grained classes. In this paper, we propose a bag-level top-down attention framework, which could tackle label noises and mine subtle cues simultaneously and integrally. Specifically, our method first extracts high-level semantic information from a bag of images belonging to the same class, and then uses the bag-level information to mine discriminative regions in various scales of each image. Besides, we propose to derive attention weights from attention maps to weight the bag-level fusion for a robust supervision. We also propose an attention loss on self-bag attention and cross-bag attention to facilitate the learning of valid attention. Extensive experiments on four WSL-FGVC datasets, i.e., Web-Aircraft, Web-Bird, Web-Car, and WebiNat-5089, demonstrate the effectiveness of our method against the state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2641-2653"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10977734/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Webly-supervised fine-grained visual classification (WSL-FGVC) aims to learn similar sub-classes from cheap web images, which suffers from two major issues: label noises in web images and subtle differences among fine-grained classes. However, existing methods for WSL-FGVC only focus on suppressing noise at image-level, but neglect to mine cues at pixel-level to distinguish the subtle differences among fine-grained classes. In this paper, we propose a bag-level top-down attention framework, which could tackle label noises and mine subtle cues simultaneously and integrally. Specifically, our method first extracts high-level semantic information from a bag of images belonging to the same class, and then uses the bag-level information to mine discriminative regions in various scales of each image. Besides, we propose to derive attention weights from attention maps to weight the bag-level fusion for a robust supervision. We also propose an attention loss on self-bag attention and cross-bag attention to facilitate the learning of valid attention. Extensive experiments on four WSL-FGVC datasets, i.e., Web-Aircraft, Web-Bird, Web-Car, and WebiNat-5089, demonstrate the effectiveness of our method against the state-of-the-art methods.