GNN-based primitive recombination for compositional zero-shot learning

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-10-08 DOI:10.1016/j.imavis.2025.105762

Fuqin Deng , Caiyun Tang , Lanhui Fu , Wei Jin , Jiaming Zhong , Hongming Wang , Nannan Li

{"title":"GNN-based primitive recombination for compositional zero-shot learning","authors":"Fuqin Deng , Caiyun Tang , Lanhui Fu , Wei Jin , Jiaming Zhong , Hongming Wang , Nannan Li","doi":"10.1016/j.imavis.2025.105762","DOIUrl":null,"url":null,"abstract":"<div><div>Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute–object combinations, with the core challenge being the complex visual manifestations across compositions. We posit that the key to address this challenge lies in enabling models to simulate human recognition processes by decomposing and dynamically recombining primitives (attributes and objects). Existing methods merely concatenate primitives after extraction to form new combinations, without achieving deep integration between attributes and objects to create truly novel compositions. To address this issue, we propose Graph Neural Network-based Primitive Recombination (GPR) framework. This framework innovatively designs a Primitive Recombination Module (PRM) based on the Compositional Matching Module (CMM). Specifically, we first extract primitives, and build independent attribute and object space based on the CLIP model, enabling more precise learning of primitive-level visual features and reducing information residuals. Additionally, we introduce a Virtual Composition Unit (VCU), which inputs optimized primitive features as nodes into GNN and models complex interaction relationships between attributes and objects through message propagation. The module performs mean pooling on the updated node features to obtain a recombined representation and fuses the global visual information from the original image through residual connections, generating semantically rich virtual compositional features while preserving key visual cues. We conduct extensive experiments on three CZSL benchmark datasets to show that GPR achieves state-of-the-art or competitive performance in both closed-world and open-world settings.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105762"},"PeriodicalIF":4.2000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003506","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute–object combinations, with the core challenge being the complex visual manifestations across compositions. We posit that the key to address this challenge lies in enabling models to simulate human recognition processes by decomposing and dynamically recombining primitives (attributes and objects). Existing methods merely concatenate primitives after extraction to form new combinations, without achieving deep integration between attributes and objects to create truly novel compositions. To address this issue, we propose Graph Neural Network-based Primitive Recombination (GPR) framework. This framework innovatively designs a Primitive Recombination Module (PRM) based on the Compositional Matching Module (CMM). Specifically, we first extract primitives, and build independent attribute and object space based on the CLIP model, enabling more precise learning of primitive-level visual features and reducing information residuals. Additionally, we introduce a Virtual Composition Unit (VCU), which inputs optimized primitive features as nodes into GNN and models complex interaction relationships between attributes and objects through message propagation. The module performs mean pooling on the updated node features to obtain a recombined representation and fuses the global visual information from the original image through residual connections, generating semantically rich virtual compositional features while preserving key visual cues. We conduct extensive experiments on three CZSL benchmark datasets to show that GPR achieves state-of-the-art or competitive performance in both closed-world and open-world settings.

Abstract Image

查看原文本刊更多论文

基于gnn的成分零射击学习原语重组

composition Zero-Shot Learning （CZSL）旨在识别看不见的属性-对象组合，其核心挑战是跨组合的复杂视觉表现。我们认为解决这一挑战的关键在于使模型能够通过分解和动态重组原语（属性和对象）来模拟人类识别过程。现有的方法只是将提取后的原语连接起来，形成新的组合，而没有实现属性和对象之间的深度集成，以创建真正新颖的组合。为了解决这个问题，我们提出了基于图神经网络的原语重组（GPR）框架。该框架在组合匹配模块（CMM）的基础上，创新地设计了一个原语重组模块（PRM）。具体而言，我们首先提取原语，并基于CLIP模型构建独立的属性和对象空间，从而更精确地学习原语级视觉特征，减少信息残差。此外，我们还引入了虚拟组合单元（VCU），该单元将优化的原始特征作为节点输入到GNN中，并通过消息传播对属性和对象之间复杂的交互关系进行建模。该模块对更新后的节点特征进行均值池化，得到重组后的表示，并通过残差连接融合原始图像的全局视觉信息，生成语义丰富的虚拟构成特征，同时保留关键的视觉线索。我们在三个CZSL基准数据集上进行了广泛的实验，以表明GPR在封闭世界和开放世界设置中都达到了最先进或具有竞争力的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.