Edwin Kwadwo Tenagyei, Zongbo Hao, Kwadwo Kusi, K. Sarpong
{"title":"基于三维和二维CNN融合的鲁棒实时人体动作检测","authors":"Edwin Kwadwo Tenagyei, Zongbo Hao, Kwadwo Kusi, K. Sarpong","doi":"10.1109/PRML52754.2021.9520696","DOIUrl":null,"url":null,"abstract":"Recent approaches for human action detection often rely on appearance and optical flow networks for frame-level detections before linking them to form action tubes. However, they achieve unsatisfactory performance in real-time due to their huge computational complexity and large parameter usage during training. In this paper, we design and implement a unified end-to-end convolutional neural network (CNN) architecture that consists of two branches, extracting both spatial and temporal information concurrently before predicting bounding boxes and action probabilities from video clips. We also design a novel mechanism that exploits the inter-channel dependencies for an effective fusion of features from the branches. Specifically, we propose a Channel Fusion and Relation-Global Attention (CFRGA) module to aggregate the two features smoothly and model their inter-channel dependencies by considering their global scope structural relation information when inferring attention. We conduct experiments on the untrimmed video dataset, UCF101-24, and achieved impressive results in frame-mAP and video-mAP. The experimental results show that our channel fusion and relation-global attention module contributes to its good performance.","PeriodicalId":429603,"journal":{"name":"2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust Real-Time Human Action Detection through the Fusion of 3D and 2D CNN\",\"authors\":\"Edwin Kwadwo Tenagyei, Zongbo Hao, Kwadwo Kusi, K. Sarpong\",\"doi\":\"10.1109/PRML52754.2021.9520696\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent approaches for human action detection often rely on appearance and optical flow networks for frame-level detections before linking them to form action tubes. However, they achieve unsatisfactory performance in real-time due to their huge computational complexity and large parameter usage during training. In this paper, we design and implement a unified end-to-end convolutional neural network (CNN) architecture that consists of two branches, extracting both spatial and temporal information concurrently before predicting bounding boxes and action probabilities from video clips. We also design a novel mechanism that exploits the inter-channel dependencies for an effective fusion of features from the branches. Specifically, we propose a Channel Fusion and Relation-Global Attention (CFRGA) module to aggregate the two features smoothly and model their inter-channel dependencies by considering their global scope structural relation information when inferring attention. We conduct experiments on the untrimmed video dataset, UCF101-24, and achieved impressive results in frame-mAP and video-mAP. The experimental results show that our channel fusion and relation-global attention module contributes to its good performance.\",\"PeriodicalId\":429603,\"journal\":{\"name\":\"2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PRML52754.2021.9520696\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRML52754.2021.9520696","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Robust Real-Time Human Action Detection through the Fusion of 3D and 2D CNN
Recent approaches for human action detection often rely on appearance and optical flow networks for frame-level detections before linking them to form action tubes. However, they achieve unsatisfactory performance in real-time due to their huge computational complexity and large parameter usage during training. In this paper, we design and implement a unified end-to-end convolutional neural network (CNN) architecture that consists of two branches, extracting both spatial and temporal information concurrently before predicting bounding boxes and action probabilities from video clips. We also design a novel mechanism that exploits the inter-channel dependencies for an effective fusion of features from the branches. Specifically, we propose a Channel Fusion and Relation-Global Attention (CFRGA) module to aggregate the two features smoothly and model their inter-channel dependencies by considering their global scope structural relation information when inferring attention. We conduct experiments on the untrimmed video dataset, UCF101-24, and achieved impressive results in frame-mAP and video-mAP. The experimental results show that our channel fusion and relation-global attention module contributes to its good performance.