A New Dataset and Approach for Timestamp Supervised Action Segmentation Using Human Object Interaction

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) Pub Date : 2023-06-01 DOI:10.1109/CVPRW59228.2023.00315

S. Sayed, Reza Ghoddoosian, Bhaskar Trivedi, V. Athitsos

{"title":"A New Dataset and Approach for Timestamp Supervised Action Segmentation Using Human Object Interaction","authors":"S. Sayed, Reza Ghoddoosian, Bhaskar Trivedi, V. Athitsos","doi":"10.1109/CVPRW59228.2023.00315","DOIUrl":null,"url":null,"abstract":"This paper focuses on leveraging Human Object Interaction (HOI) information to improve temporal action segmentation under timestamp supervision, where only one frame is annotated for each action segment. This information is obtained from an off-the-shelf pre-trained HOI detector, that requires no additional HOI-related annotations in our experimental datasets. Our approach generates pseudo labels by expanding the annotated timestamps into intervals and allows the system to exploit the spatio-temporal continuity of human interaction with an object to segment the video. We also propose the (3+1)Real-time Cooking (ReC)1 dataset as a realistic collection of videos from 30 participants cooking 15 breakfast items. Our dataset has three main properties: 1) to our knowledge, the first to offer synchronized third and first person videos, 2) it incorporates diverse actions and tasks, and 3) it consists of high resolution frames to detect fine-grained information. In our experiments we benchmark state-of-the-art segmentation methods under different levels of supervision on our dataset. We also quantitatively show the advantages of using HOI information, as our framework improves its baseline segmentation method on several challenging datasets with varying viewpoints, providing improvements of up to 10.9% and 5.3% in F1 score and frame-wise accuracy respectively.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"113 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPRW59228.2023.00315","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This paper focuses on leveraging Human Object Interaction (HOI) information to improve temporal action segmentation under timestamp supervision, where only one frame is annotated for each action segment. This information is obtained from an off-the-shelf pre-trained HOI detector, that requires no additional HOI-related annotations in our experimental datasets. Our approach generates pseudo labels by expanding the annotated timestamps into intervals and allows the system to exploit the spatio-temporal continuity of human interaction with an object to segment the video. We also propose the (3+1)Real-time Cooking (ReC)1 dataset as a realistic collection of videos from 30 participants cooking 15 breakfast items. Our dataset has three main properties: 1) to our knowledge, the first to offer synchronized third and first person videos, 2) it incorporates diverse actions and tasks, and 3) it consists of high resolution frames to detect fine-grained information. In our experiments we benchmark state-of-the-art segmentation methods under different levels of supervision on our dataset. We also quantitatively show the advantages of using HOI information, as our framework improves its baseline segmentation method on several challenging datasets with varying viewpoints, providing improvements of up to 10.9% and 5.3% in F1 score and frame-wise accuracy respectively.

查看原文本刊更多论文

基于人机交互的时间戳监督动作分割新数据集与方法

本文重点研究了在时间戳监督下，利用人类对象交互(HOI)信息来改进时间动作分割，即每个动作片段只标注一帧。这些信息是通过现成的预训练HOI检测器获得的，在我们的实验数据集中不需要额外的HOI相关注释。我们的方法通过将带注释的时间戳扩展为间隔来生成伪标签，并允许系统利用人类与对象交互的时空连续性来分割视频。我们还提出了(3+1)实时烹饪(ReC)1数据集，作为30名参与者烹饪15种早餐的视频的真实集合。我们的数据集有三个主要属性:1)据我们所知，第一个提供同步的第三人称和第一人称视频，2)它包含不同的动作和任务，3)它由高分辨率帧组成，以检测细粒度信息。在我们的实验中，我们在不同级别的监督下对数据集进行了最先进的分割方法的基准测试。我们还定量地展示了使用HOI信息的优势，因为我们的框架在具有不同视点的几个具有挑战性的数据集上改进了其基线分割方法，在F1分数和帧精度方面分别提高了10.9%和5.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

自引率

0.00%

发文量