Jinguo Luo,Weihong Ren,Zhiyong Wang,Xi'ai Chen,Huijie Fan,Zhi Han,Honghai Liu
{"title":"Synergistic Prompting Learning for Human-Object Interaction Detection.","authors":"Jinguo Luo,Weihong Ren,Zhiyong Wang,Xi'ai Chen,Huijie Fan,Zhi Han,Honghai Liu","doi":"10.1109/tip.2025.3607614","DOIUrl":null,"url":null,"abstract":"Human-Object Interaction (HOI) detection, as a foundational task in human-centric understanding, aims to detect interactive triplets in real-world scenarios. To better distinguish diverse HOIs within an open-world context, current HOI detectors utilize pre-trained Visual-Language Models (VLMs) to extract prior knowledge through textual prompts (i.e., descriptive texts for each HOI instance). However, relying on predetermined descriptive texts, such approaches only acquire a fixed set of textual knowledge for HOI prediction, consequently resulting in inferior performance and limited generalization. To remedy this, we propose a novel VLM-based method, which jointly performs prompting learning from both visual and textual perspectives and synergizes visual-textual prompting for HOI detection. Initially, we design a hierarchical adaptation architecture to perform progressive prompting: visual prompting is facilitated through gradual token migration from VLM's image encoder, while textual prompting is initialized with progressively leveled interaction descriptions. In addition, to synergize the visual-textual prompting learning, a text-supervising and image-tuning loop is introduced, in which the text-supervising stage guides visual prompting learning through contrastive learning and the image-tuning stage refines textual prompting by modal matching. Finally, we employ an interaction-aware knowledge merging mechanism to effectively transfer visual-textual knowledge encapsulated within synergistic prompting for HOI detection. Extensive experiments on two benchmarks demonstrate that our proposed method outperforms the state-of-the-art ones, under both supervised and zero-shot settings.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"35 1","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tip.2025.3607614","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Human-Object Interaction (HOI) detection, as a foundational task in human-centric understanding, aims to detect interactive triplets in real-world scenarios. To better distinguish diverse HOIs within an open-world context, current HOI detectors utilize pre-trained Visual-Language Models (VLMs) to extract prior knowledge through textual prompts (i.e., descriptive texts for each HOI instance). However, relying on predetermined descriptive texts, such approaches only acquire a fixed set of textual knowledge for HOI prediction, consequently resulting in inferior performance and limited generalization. To remedy this, we propose a novel VLM-based method, which jointly performs prompting learning from both visual and textual perspectives and synergizes visual-textual prompting for HOI detection. Initially, we design a hierarchical adaptation architecture to perform progressive prompting: visual prompting is facilitated through gradual token migration from VLM's image encoder, while textual prompting is initialized with progressively leveled interaction descriptions. In addition, to synergize the visual-textual prompting learning, a text-supervising and image-tuning loop is introduced, in which the text-supervising stage guides visual prompting learning through contrastive learning and the image-tuning stage refines textual prompting by modal matching. Finally, we employ an interaction-aware knowledge merging mechanism to effectively transfer visual-textual knowledge encapsulated within synergistic prompting for HOI detection. Extensive experiments on two benchmarks demonstrate that our proposed method outperforms the state-of-the-art ones, under both supervised and zero-shot settings.
期刊介绍:
The IEEE Transactions on Image Processing delves into groundbreaking theories, algorithms, and structures concerning the generation, acquisition, manipulation, transmission, scrutiny, and presentation of images, video, and multidimensional signals across diverse applications. Topics span mathematical, statistical, and perceptual aspects, encompassing modeling, representation, formation, coding, filtering, enhancement, restoration, rendering, halftoning, search, and analysis of images, video, and multidimensional signals. Pertinent applications range from image and video communications to electronic imaging, biomedical imaging, image and video systems, and remote sensing.