{"title":"基于自适应特征学习CNN的人群场景行为识别","authors":"Aliyu Nuhu Shuaibu, A. Malik, I. Faye","doi":"10.1109/ICSIPA.2017.8120636","DOIUrl":null,"url":null,"abstract":"Learning and recognizing 3-dimension (3D) adaptive features are important for crowd scene understanding in video surveillance research. Deep learning architectures such as Convolutional Neural Networks (CNN) have recently shown much success in various computer vision applications. Existing approaches such as hand-crafted method and 2D-CNN architectures are widely used in adaptive feature representations on image data. However, learning dynamic and temporal features in 3D scale features in videos remains an open problem. In this study, we proposed a novel technique 3D-scale Convolutional Neural Network (3DS-CNN), based on the decomposition of 3D feature maps into 2D spatio and 2D temporal feature representations. Extensive experiments on hundreds of video scene were demonstrated on publicly available crowd datasets. Quantitative and qualitative evaluations indicate that the proposed model display superior performance when compared to baseline approaches. The mean average precision of 95.30% was recorded on WWW crowd dataset.","PeriodicalId":268112,"journal":{"name":"2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Adaptive feature learning CNN for behavior recognition in crowd scene\",\"authors\":\"Aliyu Nuhu Shuaibu, A. Malik, I. Faye\",\"doi\":\"10.1109/ICSIPA.2017.8120636\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learning and recognizing 3-dimension (3D) adaptive features are important for crowd scene understanding in video surveillance research. Deep learning architectures such as Convolutional Neural Networks (CNN) have recently shown much success in various computer vision applications. Existing approaches such as hand-crafted method and 2D-CNN architectures are widely used in adaptive feature representations on image data. However, learning dynamic and temporal features in 3D scale features in videos remains an open problem. In this study, we proposed a novel technique 3D-scale Convolutional Neural Network (3DS-CNN), based on the decomposition of 3D feature maps into 2D spatio and 2D temporal feature representations. Extensive experiments on hundreds of video scene were demonstrated on publicly available crowd datasets. Quantitative and qualitative evaluations indicate that the proposed model display superior performance when compared to baseline approaches. The mean average precision of 95.30% was recorded on WWW crowd dataset.\",\"PeriodicalId\":268112,\"journal\":{\"name\":\"2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA)\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSIPA.2017.8120636\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSIPA.2017.8120636","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Adaptive feature learning CNN for behavior recognition in crowd scene
Learning and recognizing 3-dimension (3D) adaptive features are important for crowd scene understanding in video surveillance research. Deep learning architectures such as Convolutional Neural Networks (CNN) have recently shown much success in various computer vision applications. Existing approaches such as hand-crafted method and 2D-CNN architectures are widely used in adaptive feature representations on image data. However, learning dynamic and temporal features in 3D scale features in videos remains an open problem. In this study, we proposed a novel technique 3D-scale Convolutional Neural Network (3DS-CNN), based on the decomposition of 3D feature maps into 2D spatio and 2D temporal feature representations. Extensive experiments on hundreds of video scene were demonstrated on publicly available crowd datasets. Quantitative and qualitative evaluations indicate that the proposed model display superior performance when compared to baseline approaches. The mean average precision of 95.30% was recorded on WWW crowd dataset.