{"title":"重姿态授权RGB网视频动作识别","authors":"Song Ren, Meng Ding","doi":"10.1109/ICCECE58074.2023.10135328","DOIUrl":null,"url":null,"abstract":"Recently, works related to video action recognition focus on using hybrid streams as input to get better results. Those streams usually are combinations of RGB channel with one additional feature stream such as audio, optical flow and pose information. Among those extra streams, posture as unstructured data is more difficult to fuse with RGB channel than the others. In this paper, we propose our Heavy Pose Empowered RGB Nets (HPER-Nets) ‐‐an end-to-end multitasking model‐‐based on the thorough investigation on how to fuse posture and RGB information. Given video frames as the only input, our model will reinforce it by merging the intrinsic posture information in the form of part affinity fields (PAFs), and use this hybrid stream to perform further video action recognition. Experimental results show that our model can outperform other different methods on UCF-101, UMDB and Kinetics datasets, and with only 16 frames, a 95.3% Top-1 accuracy on UCF101, a 69.6% on HMDB and a 41.0% on Kinetics have been recorded.","PeriodicalId":120030,"journal":{"name":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Heavy Pose Empowered RGB Nets for Video Action Recognition\",\"authors\":\"Song Ren, Meng Ding\",\"doi\":\"10.1109/ICCECE58074.2023.10135328\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, works related to video action recognition focus on using hybrid streams as input to get better results. Those streams usually are combinations of RGB channel with one additional feature stream such as audio, optical flow and pose information. Among those extra streams, posture as unstructured data is more difficult to fuse with RGB channel than the others. In this paper, we propose our Heavy Pose Empowered RGB Nets (HPER-Nets) ‐‐an end-to-end multitasking model‐‐based on the thorough investigation on how to fuse posture and RGB information. Given video frames as the only input, our model will reinforce it by merging the intrinsic posture information in the form of part affinity fields (PAFs), and use this hybrid stream to perform further video action recognition. Experimental results show that our model can outperform other different methods on UCF-101, UMDB and Kinetics datasets, and with only 16 frames, a 95.3% Top-1 accuracy on UCF101, a 69.6% on HMDB and a 41.0% on Kinetics have been recorded.\",\"PeriodicalId\":120030,\"journal\":{\"name\":\"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCECE58074.2023.10135328\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCECE58074.2023.10135328","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Heavy Pose Empowered RGB Nets for Video Action Recognition
Recently, works related to video action recognition focus on using hybrid streams as input to get better results. Those streams usually are combinations of RGB channel with one additional feature stream such as audio, optical flow and pose information. Among those extra streams, posture as unstructured data is more difficult to fuse with RGB channel than the others. In this paper, we propose our Heavy Pose Empowered RGB Nets (HPER-Nets) ‐‐an end-to-end multitasking model‐‐based on the thorough investigation on how to fuse posture and RGB information. Given video frames as the only input, our model will reinforce it by merging the intrinsic posture information in the form of part affinity fields (PAFs), and use this hybrid stream to perform further video action recognition. Experimental results show that our model can outperform other different methods on UCF-101, UMDB and Kinetics datasets, and with only 16 frames, a 95.3% Top-1 accuracy on UCF101, a 69.6% on HMDB and a 41.0% on Kinetics have been recorded.