ST-HOI:视频中人-物交互检测的时空基线

Proceedings of the 2021 ACM Workshop on Intelligent Cross-Data Analysis and Retrieval Pub Date : 2021-05-25 DOI:10.1145/3463944.3469097

Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann, Jiashi Feng

{"title":"ST-HOI:视频中人-物交互检测的时空基线","authors":"Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann, Jiashi Feng","doi":"10.1145/3463944.3469097","DOIUrl":null,"url":null,"abstract":"Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.","PeriodicalId":394510,"journal":{"name":"Proceedings of the 2021 ACM Workshop on Intelligent Cross-Data Analysis and Retrieval","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos\",\"authors\":\"Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann, Jiashi Feng\",\"doi\":\"10.1145/3463944.3469097\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.\",\"PeriodicalId\":394510,\"journal\":{\"name\":\"Proceedings of the 2021 ACM Workshop on Intelligent Cross-Data Analysis and Retrieval\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2021 ACM Workshop on Intelligent Cross-Data Analysis and Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3463944.3469097\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 ACM Workshop on Intelligent Cross-Data Analysis and Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3463944.3469097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

检测人机交互(HOI)是对机器进行全面视觉理解的重要一步。虽然从静态图像中检测非时间hoi(例如，坐在椅子上)是可行的，但人类甚至不太可能从单个视频帧中猜测与时间相关的hoi(例如，打开/关闭一扇门)，其中相邻帧起着至关重要的作用。然而，传统的仅对静态图像操作的HOI方法已被用于预测与时间相关的交互，这实际上是在没有时间上下文的情况下进行猜测，可能导致性能次优。在本文中，我们通过检测具有明确时间信息的基于视频的hoi来弥补这一差距。我们首先表明，由于特征不一致问题，普通动作检测基线的朴素时间感知变体不适用于基于视频的hoi。然后，我们提出了一个简单而有效的架构，称为时空HOI检测(ST-HOI)，利用时间信息，如人和物体轨迹，正确定位的视觉特征和时空掩蔽姿态特征。我们构建了一个新的视频HOI基准，称为VidHOI，其中我们提出的方法作为坚实的基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2021 ACM Workshop on Intelligent Cross-Data Analysis and Retrieval

自引率

0.00%

发文量