具有可行性和情境依赖性的开放世界组合零射击学习简单原语

IF 20.8 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2022-11-05 DOI:10.48550/arXiv.2211.02895

Zhe Liu, Yun Li, L. Yao, Xiaojun Chang, Wei Fang, Xiaojun Wu, Yi Yang

{"title":"具有可行性和情境依赖性的开放世界组合零射击学习简单原语","authors":"Zhe Liu, Yun Li, L. Yao, Xiaojun Chang, Wei Fang, Xiaojun Wu, Yi Yang","doi":"10.48550/arXiv.2211.02895","DOIUrl":null,"url":null,"abstract":"The task of Open-World Compositional Zero-Shot Learning (OW-CZSL) is to recognize novel state-object compositions in images from all possible compositions, where the novel compositions are absent during the training stage. The performance of conventional methods degrades significantly due to the large cardinality of possible compositions. Some recent works consider simple primitives (i.e., states and objects) independent and separately predict them to reduce cardinality. However, it ignores the heavy dependence between states, objects, and compositions. In this paper, we model the dependence via feasibility and contextuality. Feasibility-dependence refers to the unequal feasibility of compositions, e.g., hairy is more feasible with cat than with building in the real world. Contextuality-dependence represents the contextual variance in images, e.g., cat shows diverse appearances when it is dry or wet. We design Semantic Attention (SA) to capture the feasibility semantics to alleviate impossible predictions, driven by the visual similarity between simple primitives. We also propose a generative Knowledge Disentanglement (KD) to disentangle images into unbiased representations, easing the contextual bias. Moreover, we complement the independent compositional probability model with the learned feasibility and contextuality compatibly. In the experiments, we demonstrate our superior or competitive performance, SA-and-kD-guided Simple Primitives (SAD-SP), on three benchmark datasets.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":" ","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Simple Primitives with Feasibility- and Contextuality-Dependence for Open-World Compositional Zero-shot Learning\",\"authors\":\"Zhe Liu, Yun Li, L. Yao, Xiaojun Chang, Wei Fang, Xiaojun Wu, Yi Yang\",\"doi\":\"10.48550/arXiv.2211.02895\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The task of Open-World Compositional Zero-Shot Learning (OW-CZSL) is to recognize novel state-object compositions in images from all possible compositions, where the novel compositions are absent during the training stage. The performance of conventional methods degrades significantly due to the large cardinality of possible compositions. Some recent works consider simple primitives (i.e., states and objects) independent and separately predict them to reduce cardinality. However, it ignores the heavy dependence between states, objects, and compositions. In this paper, we model the dependence via feasibility and contextuality. Feasibility-dependence refers to the unequal feasibility of compositions, e.g., hairy is more feasible with cat than with building in the real world. Contextuality-dependence represents the contextual variance in images, e.g., cat shows diverse appearances when it is dry or wet. We design Semantic Attention (SA) to capture the feasibility semantics to alleviate impossible predictions, driven by the visual similarity between simple primitives. We also propose a generative Knowledge Disentanglement (KD) to disentangle images into unbiased representations, easing the contextual bias. Moreover, we complement the independent compositional probability model with the learned feasibility and contextuality compatibly. In the experiments, we demonstrate our superior or competitive performance, SA-and-kD-guided Simple Primitives (SAD-SP), on three benchmark datasets.\",\"PeriodicalId\":13426,\"journal\":{\"name\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":20.8000,\"publicationDate\":\"2022-11-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2211.02895\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.48550/arXiv.2211.02895","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

摘要

Open-World composition Zero-Shot Learning (low - czsl)的任务是从所有可能的组合中识别图像中的新状态-对象组合，其中新组合在训练阶段不存在。由于可能组合的基数很大，传统方法的性能显著降低。最近的一些研究认为简单的原语(即状态和对象)是独立的，并分别预测它们以减少基数。然而，它忽略了状态、对象和组合之间的严重依赖关系。在本文中，我们通过可行性和情境性来建模依赖关系。可行性依赖是指组合的不平等可行性，例如，在现实世界中，毛茸茸的猫比建筑更可行。情境依赖性表示图像中的情境差异，例如，猫在干燥或潮湿时表现出不同的外观。我们设计了语义注意(Semantic Attention, SA)来捕获可行性语义，以减轻由简单原语之间的视觉相似性驱动的不可能预测。我们还提出了一种生成式知识解纠缠(KD)来将图像解纠缠为无偏表示，从而缓解语境偏见。此外，我们将独立组合概率模型与学习到的可行性和情境性相结合。在实验中，我们在三个基准数据集上展示了我们的优越或竞争性能，sa和kd引导的简单原语(SAD-SP)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Simple Primitives with Feasibility- and Contextuality-Dependence for Open-World Compositional Zero-shot Learning

The task of Open-World Compositional Zero-Shot Learning (OW-CZSL) is to recognize novel state-object compositions in images from all possible compositions, where the novel compositions are absent during the training stage. The performance of conventional methods degrades significantly due to the large cardinality of possible compositions. Some recent works consider simple primitives (i.e., states and objects) independent and separately predict them to reduce cardinality. However, it ignores the heavy dependence between states, objects, and compositions. In this paper, we model the dependence via feasibility and contextuality. Feasibility-dependence refers to the unequal feasibility of compositions, e.g., hairy is more feasible with cat than with building in the real world. Contextuality-dependence represents the contextual variance in images, e.g., cat shows diverse appearances when it is dry or wet. We design Semantic Attention (SA) to capture the feasibility semantics to alleviate impossible predictions, driven by the visual similarity between simple primitives. We also propose a generative Knowledge Disentanglement (KD) to disentangle images into unbiased representations, easing the contextual bias. Moreover, we complement the independent compositional probability model with the learned feasibility and contextuality compatibly. In the experiments, we demonstrate our superior or competitive performance, SA-and-kD-guided Simple Primitives (SAD-SP), on three benchmark datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.