Exploring interaction concepts for human–object-interaction detection via global- and local-scale enhancing

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-07-18 DOI:10.1016/j.neucom.2025.130882

Tianlun Luo , Qiao Yuan , Boxuan Zhu , Steven Guan , Rui Yang , Jeremy S. Smith , Eng Gee Lim

{"title":"Exploring interaction concepts for human–object-interaction detection via global- and local-scale enhancing","authors":"Tianlun Luo , Qiao Yuan , Boxuan Zhu , Steven Guan , Rui Yang , Jeremy S. Smith , Eng Gee Lim","doi":"10.1016/j.neucom.2025.130882","DOIUrl":null,"url":null,"abstract":"<div><div>Understanding the interactions between human–object (HO) pairs is the key to the human–object interaction (HOI) detection task. Visual understanding research has been significantly impacted by recent advances in linguistic-visual contrastive learning. For HOI detection studies, the alignment of linguistic and visual features is usually required to be performed when linguistic knowledge is used for enhancement. This usually results in the demands of extra training data or extended training time. In this study, an effective approach for utilizing multimodal knowledge to enhance HOI learning from global and instance scales is proposed. Model performance on Rare HOI categories can be prominently improved by using projection guided by linguistic knowledge at a global scale and merging multimodal features at an instance scale. State-of-the-art performance on the HICO-Det benchmark is achieved by the proposed model, and the effectiveness of the proposed global- and local-scale multimodal learning approach is validated.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"651 ","pages":"Article 130882"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225015541","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Understanding the interactions between human–object (HO) pairs is the key to the human–object interaction (HOI) detection task. Visual understanding research has been significantly impacted by recent advances in linguistic-visual contrastive learning. For HOI detection studies, the alignment of linguistic and visual features is usually required to be performed when linguistic knowledge is used for enhancement. This usually results in the demands of extra training data or extended training time. In this study, an effective approach for utilizing multimodal knowledge to enhance HOI learning from global and instance scales is proposed. Model performance on Rare HOI categories can be prominently improved by using projection guided by linguistic knowledge at a global scale and merging multimodal features at an instance scale. State-of-the-art performance on the HICO-Det benchmark is achieved by the proposed model, and the effectiveness of the proposed global- and local-scale multimodal learning approach is validated.

查看原文本刊更多论文

通过全局和局部尺度增强探索人-物交互检测的交互概念

理解人-物（HO）对之间的交互是人-物交互（HOI）检测任务的关键。视觉理解研究受到近年来语言视觉对比学习研究的显著影响。对于HOI检测研究，当使用语言知识进行增强时，通常需要进行语言和视觉特征的对齐。这通常会导致需要额外的训练数据或延长训练时间。本文提出了一种利用多模态知识在全局和实例尺度上增强HOI学习的有效方法。通过在全局尺度上使用语言知识引导的投影和在实例尺度上合并多模态特征，可以显著提高模型在稀有HOI类别上的性能。所提出的模型在HICO-Det基准上达到了最先进的性能，并验证了所提出的全局和局部尺度多模态学习方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.