Interaction Is Worth More Explanations: Improving Human–Object Interaction Representation With Propositional Knowledge

IF 4.9 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Cognitive and Developmental Systems Pub Date : 2024-11-11 DOI:10.1109/TCDS.2024.3496566

Feng Yang;Yichao Cao;Xuanpeng Li;Weigong Zhang

{"title":"Interaction Is Worth More Explanations: Improving Human–Object Interaction Representation With Propositional Knowledge","authors":"Feng Yang;Yichao Cao;Xuanpeng Li;Weigong Zhang","doi":"10.1109/TCDS.2024.3496566","DOIUrl":null,"url":null,"abstract":"Detecting human–object interactions (HOI) presents a formidable challenge, necessitating the discernment of intricate, high-level relationships between humans and objects. Recent studies have explored HOI vision-and-language modeling (HOI-VLM), which leverages linguistic information inspired by cross-modal technology. Despite its promise, current methodologies face challenges due to the constraints of limited annotation vocabularies and suboptimal word embeddings, which hinder effective alignment with visual features and consequently, the efficient transfer of linguistic knowledge. In this work, we propose a novel cross-modal framework that leverages external propositional knowledge which harmonize annotation text with a broader spectrum of world knowledge, enabling a more explicit and unambiguous representation of complex semantic relationships. Additionally, considering the prevalence of multiple complexities due to the symbiotic or distinctive relationships inherent in one HO pair, along with the identical interactions occurring with diverse HO pairs (e.g., “human ride bicycle” versus “human ride horse”). The challenge lies in understanding the subtle differences and similarities between interactions involving different objects or occurring in varied contexts. To this end, we propose the Jaccard contrast strategy to simultaneously optimize cross-modal representation consistency across HO pairs (especially for cases where multiple interactions occur), which encompasses both vision-to-vision and vision-to-knowledge alignment objectives. The effectiveness of our proposed method is comprehensively validated through extensive experiments, showcasing its superiority in the field of HOI analysis.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 3","pages":"631-643"},"PeriodicalIF":4.9000,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive and Developmental Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10750402/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Detecting human–object interactions (HOI) presents a formidable challenge, necessitating the discernment of intricate, high-level relationships between humans and objects. Recent studies have explored HOI vision-and-language modeling (HOI-VLM), which leverages linguistic information inspired by cross-modal technology. Despite its promise, current methodologies face challenges due to the constraints of limited annotation vocabularies and suboptimal word embeddings, which hinder effective alignment with visual features and consequently, the efficient transfer of linguistic knowledge. In this work, we propose a novel cross-modal framework that leverages external propositional knowledge which harmonize annotation text with a broader spectrum of world knowledge, enabling a more explicit and unambiguous representation of complex semantic relationships. Additionally, considering the prevalence of multiple complexities due to the symbiotic or distinctive relationships inherent in one HO pair, along with the identical interactions occurring with diverse HO pairs (e.g., “human ride bicycle” versus “human ride horse”). The challenge lies in understanding the subtle differences and similarities between interactions involving different objects or occurring in varied contexts. To this end, we propose the Jaccard contrast strategy to simultaneously optimize cross-modal representation consistency across HO pairs (especially for cases where multiple interactions occur), which encompasses both vision-to-vision and vision-to-knowledge alignment objectives. The effectiveness of our proposed method is comprehensively validated through extensive experiments, showcasing its superiority in the field of HOI analysis.

查看原文本刊更多论文

交互更值得解释：用命题知识改进人-物交互表征

检测人-物交互（HOI）提出了一个艰巨的挑战，需要识别复杂的，高层次的人与物体之间的关系。最近的研究探索了HOI视觉和语言建模（HOI- vlm），它利用了受跨模态技术启发的语言信息。尽管有其前景，但由于有限的标注词汇表和次优词嵌入的限制，当前的方法面临挑战，这阻碍了与视觉特征的有效对齐，从而阻碍了语言知识的有效转移。在这项工作中，我们提出了一种新的跨模态框架，该框架利用外部命题知识来协调注释文本与更广泛的世界知识，从而能够更明确、更明确地表示复杂的语义关系。此外，考虑到由于一个HO对固有的共生或独特关系以及不同HO对发生的相同相互作用（例如，“人类骑自行车”与“人类骑马”）而导致的多重复杂性的普遍存在。挑战在于理解涉及不同对象或发生在不同环境中的交互之间的细微差异和相似之处。为此，我们提出了Jaccard对比策略，以同时优化HO对之间的跨模态表示一致性（特别是在多个交互发生的情况下），其中包括视觉到视觉和视觉到知识的对齐目标。通过大量的实验，全面验证了该方法的有效性，显示了其在HOI分析领域的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cognitive and Developmental Systems Computer Science-Software

CiteScore

7.20

自引率

10.00%

发文量

170

期刊介绍： The IEEE Transactions on Cognitive and Developmental Systems (TCDS) focuses on advances in the study of development and cognition in natural (humans, animals) and artificial (robots, agents) systems. It welcomes contributions from multiple related disciplines including cognitive systems, cognitive robotics, developmental and epigenetic robotics, autonomous and evolutionary robotics, social structures, multi-agent and artificial life systems, computational neuroscience, and developmental psychology. Articles on theoretical, computational, application-oriented, and experimental studies as well as reviews in these areas are considered.