{"title":"VHOIP: Video-based Human–Object Interaction recognition with CLIP Prior knowledge","authors":"Doyeol Baek , Junsuk Choe","doi":"10.1016/j.patrec.2025.02.014","DOIUrl":null,"url":null,"abstract":"<div><div>In this paper, we introduce a novel approach to recognizing Human–Object Interactions (HOI) in videos, crucial for understanding videos focused on human activities. Traditional methods often fall short of accurately identifying subtle interactions, particularly in dynamic sequences involving multiple individuals and objects. To address these issues, we leverage the CLIP (Contrastive Language–Image Pre-training), renowned for its rich visual and linguistic knowledge. Our method, Video-based HOI recognition with CLIP Prior knowledge (VHOIP), merges the spatial and temporal analysis capabilities of a video-based HOI framework with the detailed interaction understanding from CLIP. This enhancement significantly advances our HOI recognition performances. Through rigorous validation of three different HOI recognition datasets, our method demonstrates remarkable improvements over current state-of-the-art techniques, both qualitatively and quantitatively, indicating the effectiveness of our approach.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"190 ","pages":"Pages 133-140"},"PeriodicalIF":3.9000,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016786552500056X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we introduce a novel approach to recognizing Human–Object Interactions (HOI) in videos, crucial for understanding videos focused on human activities. Traditional methods often fall short of accurately identifying subtle interactions, particularly in dynamic sequences involving multiple individuals and objects. To address these issues, we leverage the CLIP (Contrastive Language–Image Pre-training), renowned for its rich visual and linguistic knowledge. Our method, Video-based HOI recognition with CLIP Prior knowledge (VHOIP), merges the spatial and temporal analysis capabilities of a video-based HOI framework with the detailed interaction understanding from CLIP. This enhancement significantly advances our HOI recognition performances. Through rigorous validation of three different HOI recognition datasets, our method demonstrates remarkable improvements over current state-of-the-art techniques, both qualitatively and quantitatively, indicating the effectiveness of our approach.
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.