第一人称视频中使用动物对的人类动作识别

2019 27th Signal Processing and Communications Applications Conference (SIU) Pub Date : 2019-04-01 DOI:10.1109/SIU.2019.8806562

Zeynep Gökce, Selen Pehlivan

{"title":"第一人称视频中使用动物对的人类动作识别","authors":"Zeynep Gökce, Selen Pehlivan","doi":"10.1109/SIU.2019.8806562","DOIUrl":null,"url":null,"abstract":"Human action recognition problem is important for distinguishing the rich variety of human activities in first-person videos. While there has been an improvement in egocentric action recognition, the space of action categories is large and it looks impractical to label training data for all categories. In this work, we decompose action models into verb and noun model pairs and propose a method to combine them with a simple fusion strategy. Particularly, we use 3 Dimensional Convolutional Neural Network model, C3D, for verb stream to model video-based features, and we use object detection model, YOLO, for noun stream to model objects interacting with human. We present experiments on the recently introduced large-scale EGTEA Gaze+ dataset with 106 action classes, and show that our model is comparable to the state-of-the-art action recognition models.","PeriodicalId":326275,"journal":{"name":"2019 27th Signal Processing and Communications Applications Conference (SIU)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Human Action Recognition in First Person Videos using Verb-Object Pairs\",\"authors\":\"Zeynep Gökce, Selen Pehlivan\",\"doi\":\"10.1109/SIU.2019.8806562\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human action recognition problem is important for distinguishing the rich variety of human activities in first-person videos. While there has been an improvement in egocentric action recognition, the space of action categories is large and it looks impractical to label training data for all categories. In this work, we decompose action models into verb and noun model pairs and propose a method to combine them with a simple fusion strategy. Particularly, we use 3 Dimensional Convolutional Neural Network model, C3D, for verb stream to model video-based features, and we use object detection model, YOLO, for noun stream to model objects interacting with human. We present experiments on the recently introduced large-scale EGTEA Gaze+ dataset with 106 action classes, and show that our model is comparable to the state-of-the-art action recognition models.\",\"PeriodicalId\":326275,\"journal\":{\"name\":\"2019 27th Signal Processing and Communications Applications Conference (SIU)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 27th Signal Processing and Communications Applications Conference (SIU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIU.2019.8806562\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 27th Signal Processing and Communications Applications Conference (SIU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU.2019.8806562","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

人类动作识别问题对于区分第一人称视频中丰富多样的人类活动具有重要意义。虽然在以自我为中心的动作识别方面有了很大的进步，但动作类别的空间很大，对所有类别的训练数据进行标记是不切实际的。在这项工作中，我们将动作模型分解为动词和名词模型对，并提出了一种用简单的融合策略将它们组合起来的方法。其中，动词流使用三维卷积神经网络模型C3D来模拟基于视频的特征，名词流使用目标检测模型YOLO来模拟与人交互的对象。我们在最近引入的具有106个动作类的大规模EGTEA Gaze+数据集上进行了实验，并表明我们的模型与最先进的动作识别模型相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Human Action Recognition in First Person Videos using Verb-Object Pairs

Human action recognition problem is important for distinguishing the rich variety of human activities in first-person videos. While there has been an improvement in egocentric action recognition, the space of action categories is large and it looks impractical to label training data for all categories. In this work, we decompose action models into verb and noun model pairs and propose a method to combine them with a simple fusion strategy. Particularly, we use 3 Dimensional Convolutional Neural Network model, C3D, for verb stream to model video-based features, and we use object detection model, YOLO, for noun stream to model objects interacting with human. We present experiments on the recently introduced large-scale EGTEA Gaze+ dataset with 106 action classes, and show that our model is comparable to the state-of-the-art action recognition models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 27th Signal Processing and Communications Applications Conference (SIU)

自引率

0.00%

发文量