Yang Liu;Xinshuo Wang;Xinbo Gao;Jungong Han;Ling Shao
{"title":"合成零射击学习的多级上下文原型调制","authors":"Yang Liu;Xinshuo Wang;Xinbo Gao;Jungong Han;Ling Shao","doi":"10.1109/TIP.2025.3592560","DOIUrl":null,"url":null,"abstract":"Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions by leveraging prior knowledge of known primitives. However, real-world visual features of attributes and objects are often entangled, causing distribution shifts between seen and unseen combinations. Existing methods often ignore intrinsic variations and interactions among primitives, leading to poor feature discrimination and biased predictions. To address these challenges, we propose Multi-level Contextual Prototype Modulation (MCPM), a transformer-based framework with a hierarchical structure that effectively integrates attributes and objects to generate richer visual embeddings. At the feature level, we apply contrastive learning to improve discriminability across compositional tasks. At the prototype level, a subclass-driven modulator captures fine-grained attribute-object interactions, enabling better adaptation to long-tail distributions. Additionally, we introduce a Minority Attribute Enhancement (MAE) strategy that synthesizes virtual samples by mixing attribute classes, further mitigating data imbalance. Experiments on four benchmark datasets (MIT-States, C-GQA, UT-Zappos, and VAW-CZSL) show that MCPM brings significant performance improvements, verifying its effectiveness in complex composition scenes.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4856-4868"},"PeriodicalIF":13.7000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Level Contextual Prototype Modulation for Compositional Zero-Shot Learning\",\"authors\":\"Yang Liu;Xinshuo Wang;Xinbo Gao;Jungong Han;Ling Shao\",\"doi\":\"10.1109/TIP.2025.3592560\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions by leveraging prior knowledge of known primitives. However, real-world visual features of attributes and objects are often entangled, causing distribution shifts between seen and unseen combinations. Existing methods often ignore intrinsic variations and interactions among primitives, leading to poor feature discrimination and biased predictions. To address these challenges, we propose Multi-level Contextual Prototype Modulation (MCPM), a transformer-based framework with a hierarchical structure that effectively integrates attributes and objects to generate richer visual embeddings. At the feature level, we apply contrastive learning to improve discriminability across compositional tasks. At the prototype level, a subclass-driven modulator captures fine-grained attribute-object interactions, enabling better adaptation to long-tail distributions. Additionally, we introduce a Minority Attribute Enhancement (MAE) strategy that synthesizes virtual samples by mixing attribute classes, further mitigating data imbalance. Experiments on four benchmark datasets (MIT-States, C-GQA, UT-Zappos, and VAW-CZSL) show that MCPM brings significant performance improvements, verifying its effectiveness in complex composition scenes.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"4856-4868\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2025-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11104968/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11104968/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-Level Contextual Prototype Modulation for Compositional Zero-Shot Learning
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions by leveraging prior knowledge of known primitives. However, real-world visual features of attributes and objects are often entangled, causing distribution shifts between seen and unseen combinations. Existing methods often ignore intrinsic variations and interactions among primitives, leading to poor feature discrimination and biased predictions. To address these challenges, we propose Multi-level Contextual Prototype Modulation (MCPM), a transformer-based framework with a hierarchical structure that effectively integrates attributes and objects to generate richer visual embeddings. At the feature level, we apply contrastive learning to improve discriminability across compositional tasks. At the prototype level, a subclass-driven modulator captures fine-grained attribute-object interactions, enabling better adaptation to long-tail distributions. Additionally, we introduce a Minority Attribute Enhancement (MAE) strategy that synthesizes virtual samples by mixing attribute classes, further mitigating data imbalance. Experiments on four benchmark datasets (MIT-States, C-GQA, UT-Zappos, and VAW-CZSL) show that MCPM brings significant performance improvements, verifying its effectiveness in complex composition scenes.