{"title":"Fine-Grained Vision Categorization with Vision Transformer: A Survey","authors":"Yong Zhang, W. Chen, Ying Zang","doi":"10.1109/ICCC56324.2022.10065990","DOIUrl":null,"url":null,"abstract":"Fine-grained visual classification (FGVC) is to identify subcategories with very small differences from image data with large intra-class differences and small inter-class differences. It is a very challenging task in the field of computer vision. With the rapid development of deep learning, FGVC algorithms have developed from traditional strong supervised learning, which relies on a large amount of manual annotation information, to weakly supervised learning. Weakly supervised learning includes algorithms based on traditional deep convolutional neural networks and based on vision transformer (ViT). In recent years, ViT has shown strong performance in FGVC and tends to surpass deep convolutional neural networks. This paper first introduces the purpose and characteristics of fine-grained image classification tasks, then introduces the corresponding public datasets and traditional convolutional network-based algorithms, discusses the performance of ViT-based algorithms and their advantages and disadvantages, and finally summarizes these algorithms.","PeriodicalId":263098,"journal":{"name":"2022 IEEE 8th International Conference on Computer and Communications (ICCC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 8th International Conference on Computer and Communications (ICCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCC56324.2022.10065990","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Fine-grained visual classification (FGVC) is to identify subcategories with very small differences from image data with large intra-class differences and small inter-class differences. It is a very challenging task in the field of computer vision. With the rapid development of deep learning, FGVC algorithms have developed from traditional strong supervised learning, which relies on a large amount of manual annotation information, to weakly supervised learning. Weakly supervised learning includes algorithms based on traditional deep convolutional neural networks and based on vision transformer (ViT). In recent years, ViT has shown strong performance in FGVC and tends to surpass deep convolutional neural networks. This paper first introduces the purpose and characteristics of fine-grained image classification tasks, then introduces the corresponding public datasets and traditional convolutional network-based algorithms, discusses the performance of ViT-based algorithms and their advantages and disadvantages, and finally summarizes these algorithms.