Recognizing human–object interactions in videos with the supervision of natural language

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-05-27 DOI:10.1016/j.neunet.2025.107606

Qiyue Li , Xuemei Xie , Jin Zhang , Guangming Shi

{"title":"Recognizing human–object interactions in videos with the supervision of natural language","authors":"Qiyue Li , Xuemei Xie , Jin Zhang , Guangming Shi","doi":"10.1016/j.neunet.2025.107606","DOIUrl":null,"url":null,"abstract":"<div><div>Existing models for recognizing human–object interaction (HOI) in videos mainly rely on visual information for reasoning and generally treat recognition tasks as traditional multi-classification problems, where labels are represented by numbers. This supervised learning method discards semantic information in the labels and ignores advanced semantic relationships between actual categories. In fact, natural language contains a wealth of linguistic knowledge that humans have distilled about human–object interaction, and the category text contains a large amount of semantic relationships between texts. Therefore, this paper introduces human–object interaction category text features as labels and proposes a natural language supervised learning model for human–object interaction by using natural language to supervise visual feature learning to enhance visual feature expression capability. The model applies contrastive learning paradigm to human–object interaction recognition, using an image–text paired pre-training model to obtain individual image features and interaction category text features, and then using a spatial–temporal mixed module to obtain high semantic combination-based human–object interaction spatial–temporal features. Finally, the obtained visual interaction features and category text features are compared for similarity to infer the correct video human–object interaction category. The model aims to explore the semantic information in human–object interaction category label text and use a large number of image–text paired samples trained by a multi-modal pre-training model to obtain visual and textual correspondence to enhance the ability of video human–object interaction recognition. Experimental results on two human–object interaction datasets demonstrate that our method achieves the state-of-the-art performance, e.g., 93.6% and 93.1% F1 Score for Sub-activity and Affordance on CAD-120 dataset.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107606"},"PeriodicalIF":6.0000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025004861","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Existing models for recognizing human–object interaction (HOI) in videos mainly rely on visual information for reasoning and generally treat recognition tasks as traditional multi-classification problems, where labels are represented by numbers. This supervised learning method discards semantic information in the labels and ignores advanced semantic relationships between actual categories. In fact, natural language contains a wealth of linguistic knowledge that humans have distilled about human–object interaction, and the category text contains a large amount of semantic relationships between texts. Therefore, this paper introduces human–object interaction category text features as labels and proposes a natural language supervised learning model for human–object interaction by using natural language to supervise visual feature learning to enhance visual feature expression capability. The model applies contrastive learning paradigm to human–object interaction recognition, using an image–text paired pre-training model to obtain individual image features and interaction category text features, and then using a spatial–temporal mixed module to obtain high semantic combination-based human–object interaction spatial–temporal features. Finally, the obtained visual interaction features and category text features are compared for similarity to infer the correct video human–object interaction category. The model aims to explore the semantic information in human–object interaction category label text and use a large number of image–text paired samples trained by a multi-modal pre-training model to obtain visual and textual correspondence to enhance the ability of video human–object interaction recognition. Experimental results on two human–object interaction datasets demonstrate that our method achieves the state-of-the-art performance, e.g., 93.6% and 93.1% F1 Score for Sub-activity and Affordance on CAD-120 dataset.

查看原文本刊更多论文

在自然语言的监督下识别视频中的人-物交互

现有的视频人机交互（HOI）识别模型主要依靠视觉信息进行推理，通常将识别任务视为传统的多分类问题，其中标签用数字表示。这种监督学习方法丢弃了标签中的语义信息，忽略了实际类别之间的高级语义关系。事实上，自然语言包含了人类提炼出来的丰富的关于人与物交互的语言知识，范畴文本包含了文本之间大量的语义关系。因此，本文引入人-物交互范畴文本特征作为标签，提出了一种人-物交互的自然语言监督学习模型，利用自然语言监督视觉特征学习，增强视觉特征表达能力。该模型将对比学习范式应用于人-物交互识别，利用图像-文本配对预训练模型获取个体图像特征和交互类别文本特征，然后利用时空混合模块获取基于高语义组合的人-物交互时空特征。最后，将得到的视觉交互特征与类别文本特征进行相似性比较，推断出正确的视频人机交互类别。该模型旨在挖掘人-物交互类别标签文本中的语义信息，利用多模态预训练模型训练的大量图像-文本配对样本，获得视觉和文本对应关系，增强视频人-物交互识别能力。在两个人机交互数据集上的实验结果表明，我们的方法达到了最先进的性能，例如，在CAD-120数据集上，子活动和功能的F1得分分别为93.6%和93.1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.