Juan Zhao , Lili Kong , Deshang Sun , Deng Xiong , Jiancheng Lv
{"title":"Mining fine-grained attributes for vision–semantics integration in few-shot learning","authors":"Juan Zhao , Lili Kong , Deshang Sun , Deng Xiong , Jiancheng Lv","doi":"10.1016/j.imavis.2025.105739","DOIUrl":null,"url":null,"abstract":"<div><div>Recent advancements in Few-Shot Learning (FSL) have been significantly driven by leveraging semantic descriptions to enhance feature discrimination and recognition performance. However, existing methods, such as SemFew, often rely on verbose or manually curated attributes and apply semantic guidance only to the support set, limiting their effectiveness in distinguishing fine-grained categories. Inspired by human visual perception, which emphasizes crucial features for accurate recognition, this study introduces concise, fine-grained semantic attributes to address these limitations. We propose a Visual Attribute Enhancement (VAE) mechanism that integrates enriched semantic information into visual features, enabling the model to highlight the most relevant visual attributes and better distinguish visually similar samples. This module enhances visual features by aligning them with semantic attribute embeddings through a cross-attention mechanism and optimizes this alignment using an attribute-based cross-entropy loss. Furthermore, to mitigate the performance degradation caused by methods that supply semantic information exclusively to the support set, we propose a semantic attribute reconstruction (SAR) module. This module predicts and integrates semantic features for query samples, ensuring balanced information distribution between the support and query sets. Specifically, SAR enhances query representations by aligning and reconstructing semantic and visual attributes through regression and optimal transport losses to ensure semantic–visual consistency. Experiments on five benchmark datasets, including both general datasets and more challenging fine-grained Few-Shot datasets consistently demonstrate that our proposed method outperforms state-of-the-art methods in both 5-way 1-shot and 5-way 5-shot settings.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105739"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003270","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advancements in Few-Shot Learning (FSL) have been significantly driven by leveraging semantic descriptions to enhance feature discrimination and recognition performance. However, existing methods, such as SemFew, often rely on verbose or manually curated attributes and apply semantic guidance only to the support set, limiting their effectiveness in distinguishing fine-grained categories. Inspired by human visual perception, which emphasizes crucial features for accurate recognition, this study introduces concise, fine-grained semantic attributes to address these limitations. We propose a Visual Attribute Enhancement (VAE) mechanism that integrates enriched semantic information into visual features, enabling the model to highlight the most relevant visual attributes and better distinguish visually similar samples. This module enhances visual features by aligning them with semantic attribute embeddings through a cross-attention mechanism and optimizes this alignment using an attribute-based cross-entropy loss. Furthermore, to mitigate the performance degradation caused by methods that supply semantic information exclusively to the support set, we propose a semantic attribute reconstruction (SAR) module. This module predicts and integrates semantic features for query samples, ensuring balanced information distribution between the support and query sets. Specifically, SAR enhances query representations by aligning and reconstructing semantic and visual attributes through regression and optimal transport losses to ensure semantic–visual consistency. Experiments on five benchmark datasets, including both general datasets and more challenging fine-grained Few-Shot datasets consistently demonstrate that our proposed method outperforms state-of-the-art methods in both 5-way 1-shot and 5-way 5-shot settings.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.