Visual–Semantic Fuzzy Interaction Network for Zero-Shot Learning

IEEE transactions on artificial intelligence Pub Date : 2025-01-03 DOI:10.1109/TAI.2024.3524955

Xuemeng Hui;Zhunga Liu;Jiaxiang Liu;Zuowei Zhang;Longfei Wang

{"title":"Visual–Semantic Fuzzy Interaction Network for Zero-Shot Learning","authors":"Xuemeng Hui;Zhunga Liu;Jiaxiang Liu;Zuowei Zhang;Longfei Wang","doi":"10.1109/TAI.2024.3524955","DOIUrl":null,"url":null,"abstract":"Zero-shot learning (ZSL) aims to recognize unseen class image objects using manually defined semantic knowledge corresponding to both seen and unseen images. The key of ZSL lies in building the interaction between precise image data and fuzzy semantic knowledge. The fuzziness is attributed to the difficulty in quantifying human knowledge. However, the existing ZSL methods ignore the inherent fuzziness of semantic knowledge and treat it as precise data during building the visual–semantic interaction. This is not good for transferring semantic knowledge from seen classes to unseen classes. In order to solve this problem, we propose a visual–semantic fuzzy interaction network (VSFIN) for ZSL. VSFIN utilize an effective encoder–decoder structure, including a semantic prototype encoder (SPE) and visual feature decoder (VFD). The SPE and VFD enable the visual features to interact with semantic knowledge via cross-attention. To achieve visual–semantic fuzzy interaction in SPE and VFD, we introduce the concept of membership function in fuzzy set theory and design a membership loss function. This loss function allows for a certain degree of imprecision in visual–semantic interaction, thereby enabling VSFIN to becomingly utilize the given semantic knowledge. Moreover, we introduce the concept of rank sum test and propose a distribution alignment loss to alleviate the bias towards seen classes. Extensive experiments on three widely used benchmarks have demonstrated that VSFIN outperforms current state-of-the-art methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 5","pages":"1345-1359"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10820830/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Zero-shot learning (ZSL) aims to recognize unseen class image objects using manually defined semantic knowledge corresponding to both seen and unseen images. The key of ZSL lies in building the interaction between precise image data and fuzzy semantic knowledge. The fuzziness is attributed to the difficulty in quantifying human knowledge. However, the existing ZSL methods ignore the inherent fuzziness of semantic knowledge and treat it as precise data during building the visual–semantic interaction. This is not good for transferring semantic knowledge from seen classes to unseen classes. In order to solve this problem, we propose a visual–semantic fuzzy interaction network (VSFIN) for ZSL. VSFIN utilize an effective encoder–decoder structure, including a semantic prototype encoder (SPE) and visual feature decoder (VFD). The SPE and VFD enable the visual features to interact with semantic knowledge via cross-attention. To achieve visual–semantic fuzzy interaction in SPE and VFD, we introduce the concept of membership function in fuzzy set theory and design a membership loss function. This loss function allows for a certain degree of imprecision in visual–semantic interaction, thereby enabling VSFIN to becomingly utilize the given semantic knowledge. Moreover, we introduce the concept of rank sum test and propose a distribution alignment loss to alleviate the bias towards seen classes. Extensive experiments on three widely used benchmarks have demonstrated that VSFIN outperforms current state-of-the-art methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings.

查看原文本刊更多论文

零学习的视觉语义模糊交互网络

零射击学习（Zero-shot learning， ZSL）的目的是使用手动定义的与未见和见过的图像相对应的语义知识来识别未见过的类图像对象。ZSL的关键在于建立精确图像数据与模糊语义知识之间的交互。这种模糊性归因于难以量化人类知识。然而，现有的ZSL方法在构建视觉语义交互过程中忽略了语义知识固有的模糊性，将其作为精确的数据处理。这不利于将语义知识从可见类转移到不可见类。为了解决这个问题，我们提出了一种视觉语义模糊交互网络（VSFIN）。VSFIN采用了一种有效的编码器-解码器结构，包括语义原型编码器（SPE）和视觉特征解码器（VFD）。SPE和VFD使视觉特征通过交叉注意与语义知识交互。为了在SPE和VFD中实现视觉-语义模糊交互，引入模糊集理论中的隶属函数概念，设计了隶属损失函数。这个损失函数允许在视觉-语义交互中存在一定程度的不精确性，从而使VSFIN能够逐渐利用给定的语义知识。此外，我们引入了秩和检验的概念，并提出了一个分布对齐损失来减轻对看到类的偏见。在三个广泛使用的基准测试中进行的大量实验表明，VSFIN在传统ZSL （CZSL）和广义ZSL （GZSL）设置下都优于当前最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on artificial intelligence

CiteScore

7.70

自引率

0.00%

发文量