{"title":"Multi-modal Few-shot Image Recognition with enhanced semantic and visual integration","authors":"Chunru Dong, Lizhen Wang, Feng Zhang, Qiang Hua","doi":"10.1016/j.imavis.2025.105490","DOIUrl":null,"url":null,"abstract":"<div><div>Few-Shot Learning (FSL) enables models to recognize new classes with only a few examples by leveraging knowledge from known classes. Although some methods incorporate class names as prior knowledge, effectively integrating visual and semantic information remains challenging. Additionally, conventional similarity measurement techniques often result in information loss, obscure distinctions between samples, and fail to capture intra-sample diversity. To address these issues, this paper presents a Multi-modal Few-shot Image Recognition (MFSIR) approach. We first introduce the Multi-Scale Interaction Module (MSIM), which facilitates multi-scale interactions between semantic and visual features, significantly enhancing the representation of visual features. We also propose the Hybrid Similarity Measurement Module (HSMM), which integrates information from multiple dimensions to evaluate the similarity between samples by dynamically adjusting the weights of various similarity measurement methods, thereby improving the accuracy and robustness of similarity assessments. Experimental results demonstrate that our approach significantly outperforms existing methods on four FSL benchmarks, with marked improvements in FSL accuracy under 1-shot and 5-shot scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105490"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000782","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Few-Shot Learning (FSL) enables models to recognize new classes with only a few examples by leveraging knowledge from known classes. Although some methods incorporate class names as prior knowledge, effectively integrating visual and semantic information remains challenging. Additionally, conventional similarity measurement techniques often result in information loss, obscure distinctions between samples, and fail to capture intra-sample diversity. To address these issues, this paper presents a Multi-modal Few-shot Image Recognition (MFSIR) approach. We first introduce the Multi-Scale Interaction Module (MSIM), which facilitates multi-scale interactions between semantic and visual features, significantly enhancing the representation of visual features. We also propose the Hybrid Similarity Measurement Module (HSMM), which integrates information from multiple dimensions to evaluate the similarity between samples by dynamically adjusting the weights of various similarity measurement methods, thereby improving the accuracy and robustness of similarity assessments. Experimental results demonstrate that our approach significantly outperforms existing methods on four FSL benchmarks, with marked improvements in FSL accuracy under 1-shot and 5-shot scenarios.
few - shot Learning (FSL)通过利用已知类中的知识,使模型能够仅使用少量示例识别新类。尽管一些方法将类名作为先验知识,但有效地集成视觉和语义信息仍然具有挑战性。此外,传统的相似性测量技术通常会导致信息丢失,模糊样本之间的区别,并且无法捕获样本内的多样性。为了解决这些问题,本文提出了一种多模态少镜头图像识别(MFSIR)方法。我们首先引入了多尺度交互模块(MSIM),该模块促进了语义特征和视觉特征之间的多尺度交互,显著增强了视觉特征的表征。我们还提出了混合相似度度量模块(HSMM),该模块通过动态调整各种相似度度量方法的权重,集成多个维度的信息来评估样本之间的相似度,从而提高相似度评估的准确性和鲁棒性。实验结果表明,我们的方法在四个FSL基准上明显优于现有方法,在1枪和5枪场景下的FSL精度有显著提高。
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.