Synergistic Prompting Learning for Human-Object Interaction Detection.

IF 13.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing Pub Date : 2025-09-16 DOI:10.1109/tip.2025.3607614

Jinguo Luo,Weihong Ren,Zhiyong Wang,Xi'ai Chen,Huijie Fan,Zhi Han,Honghai Liu

{"title":"Synergistic Prompting Learning for Human-Object Interaction Detection.","authors":"Jinguo Luo,Weihong Ren,Zhiyong Wang,Xi'ai Chen,Huijie Fan,Zhi Han,Honghai Liu","doi":"10.1109/tip.2025.3607614","DOIUrl":null,"url":null,"abstract":"Human-Object Interaction (HOI) detection, as a foundational task in human-centric understanding, aims to detect interactive triplets in real-world scenarios. To better distinguish diverse HOIs within an open-world context, current HOI detectors utilize pre-trained Visual-Language Models (VLMs) to extract prior knowledge through textual prompts (i.e., descriptive texts for each HOI instance). However, relying on predetermined descriptive texts, such approaches only acquire a fixed set of textual knowledge for HOI prediction, consequently resulting in inferior performance and limited generalization. To remedy this, we propose a novel VLM-based method, which jointly performs prompting learning from both visual and textual perspectives and synergizes visual-textual prompting for HOI detection. Initially, we design a hierarchical adaptation architecture to perform progressive prompting: visual prompting is facilitated through gradual token migration from VLM's image encoder, while textual prompting is initialized with progressively leveled interaction descriptions. In addition, to synergize the visual-textual prompting learning, a text-supervising and image-tuning loop is introduced, in which the text-supervising stage guides visual prompting learning through contrastive learning and the image-tuning stage refines textual prompting by modal matching. Finally, we employ an interaction-aware knowledge merging mechanism to effectively transfer visual-textual knowledge encapsulated within synergistic prompting for HOI detection. Extensive experiments on two benchmarks demonstrate that our proposed method outperforms the state-of-the-art ones, under both supervised and zero-shot settings.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"35 1","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tip.2025.3607614","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Human-Object Interaction (HOI) detection, as a foundational task in human-centric understanding, aims to detect interactive triplets in real-world scenarios. To better distinguish diverse HOIs within an open-world context, current HOI detectors utilize pre-trained Visual-Language Models (VLMs) to extract prior knowledge through textual prompts (i.e., descriptive texts for each HOI instance). However, relying on predetermined descriptive texts, such approaches only acquire a fixed set of textual knowledge for HOI prediction, consequently resulting in inferior performance and limited generalization. To remedy this, we propose a novel VLM-based method, which jointly performs prompting learning from both visual and textual perspectives and synergizes visual-textual prompting for HOI detection. Initially, we design a hierarchical adaptation architecture to perform progressive prompting: visual prompting is facilitated through gradual token migration from VLM's image encoder, while textual prompting is initialized with progressively leveled interaction descriptions. In addition, to synergize the visual-textual prompting learning, a text-supervising and image-tuning loop is introduced, in which the text-supervising stage guides visual prompting learning through contrastive learning and the image-tuning stage refines textual prompting by modal matching. Finally, we employ an interaction-aware knowledge merging mechanism to effectively transfer visual-textual knowledge encapsulated within synergistic prompting for HOI detection. Extensive experiments on two benchmarks demonstrate that our proposed method outperforms the state-of-the-art ones, under both supervised and zero-shot settings.

查看原文本刊更多论文

人-物交互检测的协同提示学习。

人-物交互（HOI）检测是人类中心理解的一项基础任务，旨在检测现实场景中的交互三元组。为了在开放环境中更好地区分不同的HOI，当前的HOI检测器利用预训练的视觉语言模型（VLMs）通过文本提示（即每个HOI实例的描述性文本）提取先验知识。然而，这些方法依赖于预定的描述性文本，只能获得一组固定的文本知识来进行HOI预测，从而导致性能较差，泛化程度有限。为了解决这一问题，我们提出了一种新的基于vlm的方法，该方法从视觉和文本的角度共同执行提示学习，并将视觉-文本提示协同用于HOI检测。首先，我们设计了一个分层自适应架构来执行渐进式提示：视觉提示通过从VLM的图像编码器逐渐迁移令牌来实现，而文本提示则通过逐步分层的交互描述来初始化。此外，为了协同视觉-文本提示学习，引入了文本监督和图像调整循环，其中文本监督阶段通过对比学习指导视觉提示学习，图像调整阶段通过模态匹配精炼文本提示。最后，我们采用了一种交互感知的知识合并机制，有效地将封装在协同提示中的视觉文本知识转移到HOI检测中。在两个基准测试上的大量实验表明，我们提出的方法在监督和零射击设置下都优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Image Processing 工程技术-工程：电子与电气

CiteScore

20.90

自引率

6.60%

发文量

774

审稿时长

7.6 months

期刊介绍： The IEEE Transactions on Image Processing delves into groundbreaking theories, algorithms, and structures concerning the generation, acquisition, manipulation, transmission, scrutiny, and presentation of images, video, and multidimensional signals across diverse applications. Topics span mathematical, statistical, and perceptual aspects, encompassing modeling, representation, formation, coding, filtering, enhancement, restoration, rendering, halftoning, search, and analysis of images, video, and multidimensional signals. Pertinent applications range from image and video communications to electronic imaging, biomedical imaging, image and video systems, and remote sensing.