{"title":"使用双向LSTM的未修剪视频中的上下文感知动作检测","authors":"Jaideep Singh Chauhan, Yang Wang","doi":"10.1109/CRV.2018.00039","DOIUrl":null,"url":null,"abstract":"We consider the problem of action detection in untrimmed videos. We argue that the contextual information in a video is important for this task. Based on this intuition, we design a network using a bidirectional Long Short Term Memory (Bi-LSTM) model that captures the contextual information in videos. Our model includes a modified loss function which enforces the network to learn action progression, and a backpropagation in which gradients are weighted on the basis of their origin on the temporal scale. LSTMs are good at capturing the long temporal dependencies, but not so good at modeling local temporal features. In our model, we use a 3-D Convolutional Neural Network (3-D ConvNet) for capturing the local spatio-temporal features of the videos. We perform a comprehensive analysis on the importance of learning the context of the video. Finally, we evaluate our work on two action detection datasets, namely ActivityNet and THUMOS'14. Our method achieves competitive results compared with the existing approaches on both datasets.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Context-Aware Action Detection in Untrimmed Videos Using Bidirectional LSTM\",\"authors\":\"Jaideep Singh Chauhan, Yang Wang\",\"doi\":\"10.1109/CRV.2018.00039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of action detection in untrimmed videos. We argue that the contextual information in a video is important for this task. Based on this intuition, we design a network using a bidirectional Long Short Term Memory (Bi-LSTM) model that captures the contextual information in videos. Our model includes a modified loss function which enforces the network to learn action progression, and a backpropagation in which gradients are weighted on the basis of their origin on the temporal scale. LSTMs are good at capturing the long temporal dependencies, but not so good at modeling local temporal features. In our model, we use a 3-D Convolutional Neural Network (3-D ConvNet) for capturing the local spatio-temporal features of the videos. We perform a comprehensive analysis on the importance of learning the context of the video. Finally, we evaluate our work on two action detection datasets, namely ActivityNet and THUMOS'14. Our method achieves competitive results compared with the existing approaches on both datasets.\",\"PeriodicalId\":281779,\"journal\":{\"name\":\"2018 15th Conference on Computer and Robot Vision (CRV)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 15th Conference on Computer and Robot Vision (CRV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CRV.2018.00039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 15th Conference on Computer and Robot Vision (CRV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CRV.2018.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Context-Aware Action Detection in Untrimmed Videos Using Bidirectional LSTM
We consider the problem of action detection in untrimmed videos. We argue that the contextual information in a video is important for this task. Based on this intuition, we design a network using a bidirectional Long Short Term Memory (Bi-LSTM) model that captures the contextual information in videos. Our model includes a modified loss function which enforces the network to learn action progression, and a backpropagation in which gradients are weighted on the basis of their origin on the temporal scale. LSTMs are good at capturing the long temporal dependencies, but not so good at modeling local temporal features. In our model, we use a 3-D Convolutional Neural Network (3-D ConvNet) for capturing the local spatio-temporal features of the videos. We perform a comprehensive analysis on the importance of learning the context of the video. Finally, we evaluate our work on two action detection datasets, namely ActivityNet and THUMOS'14. Our method achieves competitive results compared with the existing approaches on both datasets.