Fuqin Deng , Caiyun Tang , Lanhui Fu , Wei Jin , Jiaming Zhong , Hongming Wang , Nannan Li
{"title":"GNN-based primitive recombination for compositional zero-shot learning","authors":"Fuqin Deng , Caiyun Tang , Lanhui Fu , Wei Jin , Jiaming Zhong , Hongming Wang , Nannan Li","doi":"10.1016/j.imavis.2025.105762","DOIUrl":null,"url":null,"abstract":"<div><div>Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute–object combinations, with the core challenge being the complex visual manifestations across compositions. We posit that the key to address this challenge lies in enabling models to simulate human recognition processes by decomposing and dynamically recombining primitives (attributes and objects). Existing methods merely concatenate primitives after extraction to form new combinations, without achieving deep integration between attributes and objects to create truly novel compositions. To address this issue, we propose Graph Neural Network-based Primitive Recombination (GPR) framework. This framework innovatively designs a Primitive Recombination Module (PRM) based on the Compositional Matching Module (CMM). Specifically, we first extract primitives, and build independent attribute and object space based on the CLIP model, enabling more precise learning of primitive-level visual features and reducing information residuals. Additionally, we introduce a Virtual Composition Unit (VCU), which inputs optimized primitive features as nodes into GNN and models complex interaction relationships between attributes and objects through message propagation. The module performs mean pooling on the updated node features to obtain a recombined representation and fuses the global visual information from the original image through residual connections, generating semantically rich virtual compositional features while preserving key visual cues. We conduct extensive experiments on three CZSL benchmark datasets to show that GPR achieves state-of-the-art or competitive performance in both closed-world and open-world settings.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105762"},"PeriodicalIF":4.2000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003506","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute–object combinations, with the core challenge being the complex visual manifestations across compositions. We posit that the key to address this challenge lies in enabling models to simulate human recognition processes by decomposing and dynamically recombining primitives (attributes and objects). Existing methods merely concatenate primitives after extraction to form new combinations, without achieving deep integration between attributes and objects to create truly novel compositions. To address this issue, we propose Graph Neural Network-based Primitive Recombination (GPR) framework. This framework innovatively designs a Primitive Recombination Module (PRM) based on the Compositional Matching Module (CMM). Specifically, we first extract primitives, and build independent attribute and object space based on the CLIP model, enabling more precise learning of primitive-level visual features and reducing information residuals. Additionally, we introduce a Virtual Composition Unit (VCU), which inputs optimized primitive features as nodes into GNN and models complex interaction relationships between attributes and objects through message propagation. The module performs mean pooling on the updated node features to obtain a recombined representation and fuses the global visual information from the original image through residual connections, generating semantically rich virtual compositional features while preserving key visual cues. We conduct extensive experiments on three CZSL benchmark datasets to show that GPR achieves state-of-the-art or competitive performance in both closed-world and open-world settings.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.