{"title":"第一人称视频中使用动物对的人类动作识别","authors":"Zeynep Gökce, Selen Pehlivan","doi":"10.1109/SIU.2019.8806562","DOIUrl":null,"url":null,"abstract":"Human action recognition problem is important for distinguishing the rich variety of human activities in first-person videos. While there has been an improvement in egocentric action recognition, the space of action categories is large and it looks impractical to label training data for all categories. In this work, we decompose action models into verb and noun model pairs and propose a method to combine them with a simple fusion strategy. Particularly, we use 3 Dimensional Convolutional Neural Network model, C3D, for verb stream to model video-based features, and we use object detection model, YOLO, for noun stream to model objects interacting with human. We present experiments on the recently introduced large-scale EGTEA Gaze+ dataset with 106 action classes, and show that our model is comparable to the state-of-the-art action recognition models.","PeriodicalId":326275,"journal":{"name":"2019 27th Signal Processing and Communications Applications Conference (SIU)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Human Action Recognition in First Person Videos using Verb-Object Pairs\",\"authors\":\"Zeynep Gökce, Selen Pehlivan\",\"doi\":\"10.1109/SIU.2019.8806562\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human action recognition problem is important for distinguishing the rich variety of human activities in first-person videos. While there has been an improvement in egocentric action recognition, the space of action categories is large and it looks impractical to label training data for all categories. In this work, we decompose action models into verb and noun model pairs and propose a method to combine them with a simple fusion strategy. Particularly, we use 3 Dimensional Convolutional Neural Network model, C3D, for verb stream to model video-based features, and we use object detection model, YOLO, for noun stream to model objects interacting with human. We present experiments on the recently introduced large-scale EGTEA Gaze+ dataset with 106 action classes, and show that our model is comparable to the state-of-the-art action recognition models.\",\"PeriodicalId\":326275,\"journal\":{\"name\":\"2019 27th Signal Processing and Communications Applications Conference (SIU)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 27th Signal Processing and Communications Applications Conference (SIU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIU.2019.8806562\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 27th Signal Processing and Communications Applications Conference (SIU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU.2019.8806562","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Human Action Recognition in First Person Videos using Verb-Object Pairs
Human action recognition problem is important for distinguishing the rich variety of human activities in first-person videos. While there has been an improvement in egocentric action recognition, the space of action categories is large and it looks impractical to label training data for all categories. In this work, we decompose action models into verb and noun model pairs and propose a method to combine them with a simple fusion strategy. Particularly, we use 3 Dimensional Convolutional Neural Network model, C3D, for verb stream to model video-based features, and we use object detection model, YOLO, for noun stream to model objects interacting with human. We present experiments on the recently introduced large-scale EGTEA Gaze+ dataset with 106 action classes, and show that our model is comparable to the state-of-the-art action recognition models.