{"title":"Pro-NeXt: An All-in-One Unified Model for General Fine-Grained Visual Recognition","authors":"Junde Wu;Jiayuan Zhu;Min Xu;Yueming Jin","doi":"10.1109/TPAMI.2025.3584902","DOIUrl":null,"url":null,"abstract":"Unlike general visual classification (CLS) tasks, certain CLS problems are significantly more challenging as they involve recognizing professionally categorized or highly specialized images. Fine-Grained Visual Classification (FGVC) has emerged as a broad solution to address this complexity. However, most existing methods have been predominantly evaluated on a limited set of homogeneous benchmarks, such as bird species or vehicle brands. Moreover, these approaches often train separate models for each specific task, which restricts their generalizability. This paper proposes a scalable and explainable foundational model designed to tackle a wide range of FGVC tasks from a unified and generalizable perspective. We introduce a novel architecture named Pro-NeXt and reveal that Pro-NeXt exhibits substantial generalizability across diverse professional fields such as fashion, medicine, and art areas, previously considered disparate. Our basic-sized Pro-NeXt-B surpasses all preceding task-specific models across 12 distinct datasets within 5 diverse domains. Furthermore, we find its good scaling property that scaling up Pro-NeXt in depth and width with increasing GFlops can consistently enhance its accuracy. Beyond scalability and adaptability, the intermediate features of Pro-NeXt achieve reliable object detection and segmentation performance without extra training, highlighting its solid explainability. We will release the code to promote further research in this area.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9187-9200"},"PeriodicalIF":18.6000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11081804/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Unlike general visual classification (CLS) tasks, certain CLS problems are significantly more challenging as they involve recognizing professionally categorized or highly specialized images. Fine-Grained Visual Classification (FGVC) has emerged as a broad solution to address this complexity. However, most existing methods have been predominantly evaluated on a limited set of homogeneous benchmarks, such as bird species or vehicle brands. Moreover, these approaches often train separate models for each specific task, which restricts their generalizability. This paper proposes a scalable and explainable foundational model designed to tackle a wide range of FGVC tasks from a unified and generalizable perspective. We introduce a novel architecture named Pro-NeXt and reveal that Pro-NeXt exhibits substantial generalizability across diverse professional fields such as fashion, medicine, and art areas, previously considered disparate. Our basic-sized Pro-NeXt-B surpasses all preceding task-specific models across 12 distinct datasets within 5 diverse domains. Furthermore, we find its good scaling property that scaling up Pro-NeXt in depth and width with increasing GFlops can consistently enhance its accuracy. Beyond scalability and adaptability, the intermediate features of Pro-NeXt achieve reliable object detection and segmentation performance without extra training, highlighting its solid explainability. We will release the code to promote further research in this area.