{"title":"基于双流网络的多模态人类行为识别研究","authors":"Xiao Bao","doi":"10.1109/ICISCAE52414.2021.9590666","DOIUrl":null,"url":null,"abstract":"How to use computer vision technology to automatically identify and analyze human behavior in video has become a research hotspot. In traditional behavior recognition methods, features need to be extracted manually, and the recognition effect of features largely depends on the experience of feature designers. This paper takes the dual-stream convolutional neural network as the basic theory, and uses the TSN (Temporal Segment Networks) model as the basic framework to analyze the shortcomings and shortcomings of the single-stream network and the original dual-stream network. A multi-modal human behavior recognition model based on dual-stream network is proposed. In order to extract video-level features effectively, this model adopts two attention mechanisms, which are used to learn image frame features and video-level feature transfer. Then, CNN is used to extract global motion features, and finally, it is fused with spatio-temporal features. The fusion feature is evaluated on the public data set, and the results show that the two features are complementary, and their fusion makes the features more expressive, and the recognition result on the public data set is greatly improved compared with the single spatio-temporal feature.","PeriodicalId":121049,"journal":{"name":"2021 IEEE 4th International Conference on Information Systems and Computer Aided Education (ICISCAE)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Research on Multimodal Human Behavior Recognition Based on Double Flow Network\",\"authors\":\"Xiao Bao\",\"doi\":\"10.1109/ICISCAE52414.2021.9590666\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"How to use computer vision technology to automatically identify and analyze human behavior in video has become a research hotspot. In traditional behavior recognition methods, features need to be extracted manually, and the recognition effect of features largely depends on the experience of feature designers. This paper takes the dual-stream convolutional neural network as the basic theory, and uses the TSN (Temporal Segment Networks) model as the basic framework to analyze the shortcomings and shortcomings of the single-stream network and the original dual-stream network. A multi-modal human behavior recognition model based on dual-stream network is proposed. In order to extract video-level features effectively, this model adopts two attention mechanisms, which are used to learn image frame features and video-level feature transfer. Then, CNN is used to extract global motion features, and finally, it is fused with spatio-temporal features. The fusion feature is evaluated on the public data set, and the results show that the two features are complementary, and their fusion makes the features more expressive, and the recognition result on the public data set is greatly improved compared with the single spatio-temporal feature.\",\"PeriodicalId\":121049,\"journal\":{\"name\":\"2021 IEEE 4th International Conference on Information Systems and Computer Aided Education (ICISCAE)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 4th International Conference on Information Systems and Computer Aided Education (ICISCAE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICISCAE52414.2021.9590666\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 4th International Conference on Information Systems and Computer Aided Education (ICISCAE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICISCAE52414.2021.9590666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Research on Multimodal Human Behavior Recognition Based on Double Flow Network
How to use computer vision technology to automatically identify and analyze human behavior in video has become a research hotspot. In traditional behavior recognition methods, features need to be extracted manually, and the recognition effect of features largely depends on the experience of feature designers. This paper takes the dual-stream convolutional neural network as the basic theory, and uses the TSN (Temporal Segment Networks) model as the basic framework to analyze the shortcomings and shortcomings of the single-stream network and the original dual-stream network. A multi-modal human behavior recognition model based on dual-stream network is proposed. In order to extract video-level features effectively, this model adopts two attention mechanisms, which are used to learn image frame features and video-level feature transfer. Then, CNN is used to extract global motion features, and finally, it is fused with spatio-temporal features. The fusion feature is evaluated on the public data set, and the results show that the two features are complementary, and their fusion makes the features more expressive, and the recognition result on the public data set is greatly improved compared with the single spatio-temporal feature.