Recognizing human–object interactions in videos with the supervision of natural language

IF 6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Qiyue Li , Xuemei Xie , Jin Zhang , Guangming Shi
{"title":"Recognizing human–object interactions in videos with the supervision of natural language","authors":"Qiyue Li ,&nbsp;Xuemei Xie ,&nbsp;Jin Zhang ,&nbsp;Guangming Shi","doi":"10.1016/j.neunet.2025.107606","DOIUrl":null,"url":null,"abstract":"<div><div>Existing models for recognizing human–object interaction (HOI) in videos mainly rely on visual information for reasoning and generally treat recognition tasks as traditional multi-classification problems, where labels are represented by numbers. This supervised learning method discards semantic information in the labels and ignores advanced semantic relationships between actual categories. In fact, natural language contains a wealth of linguistic knowledge that humans have distilled about human–object interaction, and the category text contains a large amount of semantic relationships between texts. Therefore, this paper introduces human–object interaction category text features as labels and proposes a natural language supervised learning model for human–object interaction by using natural language to supervise visual feature learning to enhance visual feature expression capability. The model applies contrastive learning paradigm to human–object interaction recognition, using an image–text paired pre-training model to obtain individual image features and interaction category text features, and then using a spatial–temporal mixed module to obtain high semantic combination-based human–object interaction spatial–temporal features. Finally, the obtained visual interaction features and category text features are compared for similarity to infer the correct video human–object interaction category. The model aims to explore the semantic information in human–object interaction category label text and use a large number of image–text paired samples trained by a multi-modal pre-training model to obtain visual and textual correspondence to enhance the ability of video human–object interaction recognition. Experimental results on two human–object interaction datasets demonstrate that our method achieves the state-of-the-art performance, e.g., 93.6% and 93.1% F1 Score for Sub-activity and Affordance on CAD-120 dataset.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107606"},"PeriodicalIF":6.0000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025004861","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Existing models for recognizing human–object interaction (HOI) in videos mainly rely on visual information for reasoning and generally treat recognition tasks as traditional multi-classification problems, where labels are represented by numbers. This supervised learning method discards semantic information in the labels and ignores advanced semantic relationships between actual categories. In fact, natural language contains a wealth of linguistic knowledge that humans have distilled about human–object interaction, and the category text contains a large amount of semantic relationships between texts. Therefore, this paper introduces human–object interaction category text features as labels and proposes a natural language supervised learning model for human–object interaction by using natural language to supervise visual feature learning to enhance visual feature expression capability. The model applies contrastive learning paradigm to human–object interaction recognition, using an image–text paired pre-training model to obtain individual image features and interaction category text features, and then using a spatial–temporal mixed module to obtain high semantic combination-based human–object interaction spatial–temporal features. Finally, the obtained visual interaction features and category text features are compared for similarity to infer the correct video human–object interaction category. The model aims to explore the semantic information in human–object interaction category label text and use a large number of image–text paired samples trained by a multi-modal pre-training model to obtain visual and textual correspondence to enhance the ability of video human–object interaction recognition. Experimental results on two human–object interaction datasets demonstrate that our method achieves the state-of-the-art performance, e.g., 93.6% and 93.1% F1 Score for Sub-activity and Affordance on CAD-120 dataset.
在自然语言的监督下识别视频中的人-物交互
现有的视频人机交互(HOI)识别模型主要依靠视觉信息进行推理,通常将识别任务视为传统的多分类问题,其中标签用数字表示。这种监督学习方法丢弃了标签中的语义信息,忽略了实际类别之间的高级语义关系。事实上,自然语言包含了人类提炼出来的丰富的关于人与物交互的语言知识,范畴文本包含了文本之间大量的语义关系。因此,本文引入人-物交互范畴文本特征作为标签,提出了一种人-物交互的自然语言监督学习模型,利用自然语言监督视觉特征学习,增强视觉特征表达能力。该模型将对比学习范式应用于人-物交互识别,利用图像-文本配对预训练模型获取个体图像特征和交互类别文本特征,然后利用时空混合模块获取基于高语义组合的人-物交互时空特征。最后,将得到的视觉交互特征与类别文本特征进行相似性比较,推断出正确的视频人机交互类别。该模型旨在挖掘人-物交互类别标签文本中的语义信息,利用多模态预训练模型训练的大量图像-文本配对样本,获得视觉和文本对应关系,增强视频人-物交互识别能力。在两个人机交互数据集上的实验结果表明,我们的方法达到了最先进的性能,例如,在CAD-120数据集上,子活动和功能的F1得分分别为93.6%和93.1%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Neural Networks
Neural Networks 工程技术-计算机:人工智能
CiteScore
13.90
自引率
7.70%
发文量
425
审稿时长
67 days
期刊介绍: Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信