{"title":"基于改进融合注意力CNN和RNN的人体动作识别","authors":"Han Zhao, Xinyu Jin","doi":"10.1109/ICCIA49625.2020.00028","DOIUrl":null,"url":null,"abstract":"The attention mechanism based models for computer vision and natural language processing are widely utilized, and action recognition in videos is no exception. In this paper, we develop a novel convolutional and recurrent network for action recognition which is \"doubly deep\" in spatial and temporal layers. First, in the feature extraction stage, we propose an improved p-non-local operations as a simple and effective component to capture long-distance dependencies with deep convolutional neural networks. Second, in the class prediction stage, we propose Fusion KeyLess Attention combining with the forward and backward bidirectional LSTM to learn the sequential nature of the data more efficiently and elegantly, which uses multi-epoch models fusion based on confusion matrix. Experiments on two heterogeneous datasets, HMDB51 and Hollywood2 show that our model has distinct advantages over traditional models also only utilizing RGB features for action recognition based on CNN and RNN.","PeriodicalId":237536,"journal":{"name":"2020 5th International Conference on Computational Intelligence and Applications (ICCIA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Human Action Recognition Based on Improved Fusion Attention CNN and RNN\",\"authors\":\"Han Zhao, Xinyu Jin\",\"doi\":\"10.1109/ICCIA49625.2020.00028\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The attention mechanism based models for computer vision and natural language processing are widely utilized, and action recognition in videos is no exception. In this paper, we develop a novel convolutional and recurrent network for action recognition which is \\\"doubly deep\\\" in spatial and temporal layers. First, in the feature extraction stage, we propose an improved p-non-local operations as a simple and effective component to capture long-distance dependencies with deep convolutional neural networks. Second, in the class prediction stage, we propose Fusion KeyLess Attention combining with the forward and backward bidirectional LSTM to learn the sequential nature of the data more efficiently and elegantly, which uses multi-epoch models fusion based on confusion matrix. Experiments on two heterogeneous datasets, HMDB51 and Hollywood2 show that our model has distinct advantages over traditional models also only utilizing RGB features for action recognition based on CNN and RNN.\",\"PeriodicalId\":237536,\"journal\":{\"name\":\"2020 5th International Conference on Computational Intelligence and Applications (ICCIA)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 5th International Conference on Computational Intelligence and Applications (ICCIA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCIA49625.2020.00028\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 5th International Conference on Computational Intelligence and Applications (ICCIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCIA49625.2020.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Human Action Recognition Based on Improved Fusion Attention CNN and RNN
The attention mechanism based models for computer vision and natural language processing are widely utilized, and action recognition in videos is no exception. In this paper, we develop a novel convolutional and recurrent network for action recognition which is "doubly deep" in spatial and temporal layers. First, in the feature extraction stage, we propose an improved p-non-local operations as a simple and effective component to capture long-distance dependencies with deep convolutional neural networks. Second, in the class prediction stage, we propose Fusion KeyLess Attention combining with the forward and backward bidirectional LSTM to learn the sequential nature of the data more efficiently and elegantly, which uses multi-epoch models fusion based on confusion matrix. Experiments on two heterogeneous datasets, HMDB51 and Hollywood2 show that our model has distinct advantages over traditional models also only utilizing RGB features for action recognition based on CNN and RNN.