{"title":"Recognizing human–object interactions in videos with the supervision of natural language","authors":"Qiyue Li , Xuemei Xie , Jin Zhang , Guangming Shi","doi":"10.1016/j.neunet.2025.107606","DOIUrl":null,"url":null,"abstract":"<div><div>Existing models for recognizing human–object interaction (HOI) in videos mainly rely on visual information for reasoning and generally treat recognition tasks as traditional multi-classification problems, where labels are represented by numbers. This supervised learning method discards semantic information in the labels and ignores advanced semantic relationships between actual categories. In fact, natural language contains a wealth of linguistic knowledge that humans have distilled about human–object interaction, and the category text contains a large amount of semantic relationships between texts. Therefore, this paper introduces human–object interaction category text features as labels and proposes a natural language supervised learning model for human–object interaction by using natural language to supervise visual feature learning to enhance visual feature expression capability. The model applies contrastive learning paradigm to human–object interaction recognition, using an image–text paired pre-training model to obtain individual image features and interaction category text features, and then using a spatial–temporal mixed module to obtain high semantic combination-based human–object interaction spatial–temporal features. Finally, the obtained visual interaction features and category text features are compared for similarity to infer the correct video human–object interaction category. The model aims to explore the semantic information in human–object interaction category label text and use a large number of image–text paired samples trained by a multi-modal pre-training model to obtain visual and textual correspondence to enhance the ability of video human–object interaction recognition. Experimental results on two human–object interaction datasets demonstrate that our method achieves the state-of-the-art performance, e.g., 93.6% and 93.1% F1 Score for Sub-activity and Affordance on CAD-120 dataset.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107606"},"PeriodicalIF":6.0000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025004861","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Existing models for recognizing human–object interaction (HOI) in videos mainly rely on visual information for reasoning and generally treat recognition tasks as traditional multi-classification problems, where labels are represented by numbers. This supervised learning method discards semantic information in the labels and ignores advanced semantic relationships between actual categories. In fact, natural language contains a wealth of linguistic knowledge that humans have distilled about human–object interaction, and the category text contains a large amount of semantic relationships between texts. Therefore, this paper introduces human–object interaction category text features as labels and proposes a natural language supervised learning model for human–object interaction by using natural language to supervise visual feature learning to enhance visual feature expression capability. The model applies contrastive learning paradigm to human–object interaction recognition, using an image–text paired pre-training model to obtain individual image features and interaction category text features, and then using a spatial–temporal mixed module to obtain high semantic combination-based human–object interaction spatial–temporal features. Finally, the obtained visual interaction features and category text features are compared for similarity to infer the correct video human–object interaction category. The model aims to explore the semantic information in human–object interaction category label text and use a large number of image–text paired samples trained by a multi-modal pre-training model to obtain visual and textual correspondence to enhance the ability of video human–object interaction recognition. Experimental results on two human–object interaction datasets demonstrate that our method achieves the state-of-the-art performance, e.g., 93.6% and 93.1% F1 Score for Sub-activity and Affordance on CAD-120 dataset.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.