Tianlun Luo , Qiao Yuan , Boxuan Zhu , Steven Guan , Rui Yang , Jeremy S. Smith , Eng Gee Lim
{"title":"Exploring interaction concepts for human–object-interaction detection via global- and local-scale enhancing","authors":"Tianlun Luo , Qiao Yuan , Boxuan Zhu , Steven Guan , Rui Yang , Jeremy S. Smith , Eng Gee Lim","doi":"10.1016/j.neucom.2025.130882","DOIUrl":null,"url":null,"abstract":"<div><div>Understanding the interactions between human–object (HO) pairs is the key to the human–object interaction (HOI) detection task. Visual understanding research has been significantly impacted by recent advances in linguistic-visual contrastive learning. For HOI detection studies, the alignment of linguistic and visual features is usually required to be performed when linguistic knowledge is used for enhancement. This usually results in the demands of extra training data or extended training time. In this study, an effective approach for utilizing multimodal knowledge to enhance HOI learning from global and instance scales is proposed. Model performance on Rare HOI categories can be prominently improved by using projection guided by linguistic knowledge at a global scale and merging multimodal features at an instance scale. State-of-the-art performance on the HICO-Det benchmark is achieved by the proposed model, and the effectiveness of the proposed global- and local-scale multimodal learning approach is validated.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"651 ","pages":"Article 130882"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225015541","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Understanding the interactions between human–object (HO) pairs is the key to the human–object interaction (HOI) detection task. Visual understanding research has been significantly impacted by recent advances in linguistic-visual contrastive learning. For HOI detection studies, the alignment of linguistic and visual features is usually required to be performed when linguistic knowledge is used for enhancement. This usually results in the demands of extra training data or extended training time. In this study, an effective approach for utilizing multimodal knowledge to enhance HOI learning from global and instance scales is proposed. Model performance on Rare HOI categories can be prominently improved by using projection guided by linguistic knowledge at a global scale and merging multimodal features at an instance scale. State-of-the-art performance on the HICO-Det benchmark is achieved by the proposed model, and the effectiveness of the proposed global- and local-scale multimodal learning approach is validated.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.