一种基于视觉的深度学习移动摄像机动作识别系统

International Conference on Signal Processing and Machine Learning Pub Date : 2019-11-27 DOI:10.1145/3372806.3372815

Ming-Jen Chang, Jih-Tang Hsieh, C. Fang, Sei-Wang Chen

{"title":"一种基于视觉的深度学习移动摄像机动作识别系统","authors":"Ming-Jen Chang, Jih-Tang Hsieh, C. Fang, Sei-Wang Chen","doi":"10.1145/3372806.3372815","DOIUrl":null,"url":null,"abstract":"This study presents a vision-based human action recognition system using a deep learning technique. The system can recognize human actions successfully when the camera of a robot is moving toward the target person from various directions. Therefore, the proposed method is useful for the vision system of indoor mobile robots. \n The system uses three types of information to recognize human actions, namely, information from color videos, optical flow videos, and depth videos. First, Kinect 2.0 captures color videos and depth videos simultaneously using its RGB camera and depth sensor. Second, the histogram of oriented gradient features is extracted from the color videos, and a support vector machine is used to detect the human region. Based on the detected human region, the frames of the color video are cropped and the corresponding frames of the optical flow video are obtained using the Farnebäck method (https://docs.opencv=.org/3.4/d4/dee/ tutorial_optical_flow.html). The number of frames of these videos is then unified using a frame sampling technique. Subsequently, these three types of videos are input into three modified 3D convolutional neural networks (3D CNNs) separately. The modified 3D CNNs can extract the spatiotemporal features of human actions and recognize them. Finally, these recognition results are integrated to output the final recognition result of human actions. \n The proposed system can recognize 13 types of human actions, namely, drink (sit), drink (stand), eat (sit), eat (stand), read, sit down, stand up, use a computer, walk (horizontal), walk (straight), play with a phone/tablet, walk away from each other, and walk toward each other. The average human action recognition rate of 369 test human action videos was 96.4%, indicating that the proposed system is robust and efficient.","PeriodicalId":340004,"journal":{"name":"International Conference on Signal Processing and Machine Learning","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"A Vision-based Human Action Recognition System for Moving Cameras Through Deep Learning\",\"authors\":\"Ming-Jen Chang, Jih-Tang Hsieh, C. Fang, Sei-Wang Chen\",\"doi\":\"10.1145/3372806.3372815\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study presents a vision-based human action recognition system using a deep learning technique. The system can recognize human actions successfully when the camera of a robot is moving toward the target person from various directions. Therefore, the proposed method is useful for the vision system of indoor mobile robots. \\n The system uses three types of information to recognize human actions, namely, information from color videos, optical flow videos, and depth videos. First, Kinect 2.0 captures color videos and depth videos simultaneously using its RGB camera and depth sensor. Second, the histogram of oriented gradient features is extracted from the color videos, and a support vector machine is used to detect the human region. Based on the detected human region, the frames of the color video are cropped and the corresponding frames of the optical flow video are obtained using the Farnebäck method (https://docs.opencv=.org/3.4/d4/dee/ tutorial_optical_flow.html). The number of frames of these videos is then unified using a frame sampling technique. Subsequently, these three types of videos are input into three modified 3D convolutional neural networks (3D CNNs) separately. The modified 3D CNNs can extract the spatiotemporal features of human actions and recognize them. Finally, these recognition results are integrated to output the final recognition result of human actions. \\n The proposed system can recognize 13 types of human actions, namely, drink (sit), drink (stand), eat (sit), eat (stand), read, sit down, stand up, use a computer, walk (horizontal), walk (straight), play with a phone/tablet, walk away from each other, and walk toward each other. The average human action recognition rate of 369 test human action videos was 96.4%, indicating that the proposed system is robust and efficient.\",\"PeriodicalId\":340004,\"journal\":{\"name\":\"International Conference on Signal Processing and Machine Learning\",\"volume\":\"92 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Signal Processing and Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3372806.3372815\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing and Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372806.3372815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

本研究利用深度学习技术，提出一种基于视觉的人体动作识别系统。当机器人的摄像头从不同方向向目标移动时，该系统可以成功识别人类的行为。因此，该方法对室内移动机器人的视觉系统具有实用价值。该系统使用三种类型的信息来识别人类的行为，分别是彩色视频、光流视频和深度视频。首先，Kinect 2.0使用RGB摄像头和深度传感器同时捕捉彩色视频和深度视频。其次，从彩色视频中提取有向梯度特征的直方图，并使用支持向量机检测人体区域;根据检测到的人体区域对彩色视频帧进行裁剪，通过Farnebäck方法(https://docs.opencv=.org/3.4/d4/dee/ tutorial_optical_flow.html)得到相应的光流视频帧。然后使用帧采样技术统一这些视频的帧数。随后，将这三种视频分别输入到三个改进的3D卷积神经网络(3D cnn)中。改进后的三维cnn可以提取人类动作的时空特征并进行识别。最后，将这些识别结果进行综合，输出最终的人体动作识别结果。该系统可以识别13种人类行为，即:喝水(坐)、喝水(站)、吃饭(坐)、吃饭(站)、阅读、坐下、站起来、使用电脑、走路(横着)、走路(直着)、玩手机/平板电脑、彼此走开、彼此走向。369个测试人体动作视频的平均人体动作识别率为96.4%，表明该系统具有鲁棒性和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Vision-based Human Action Recognition System for Moving Cameras Through Deep Learning

This study presents a vision-based human action recognition system using a deep learning technique. The system can recognize human actions successfully when the camera of a robot is moving toward the target person from various directions. Therefore, the proposed method is useful for the vision system of indoor mobile robots. The system uses three types of information to recognize human actions, namely, information from color videos, optical flow videos, and depth videos. First, Kinect 2.0 captures color videos and depth videos simultaneously using its RGB camera and depth sensor. Second, the histogram of oriented gradient features is extracted from the color videos, and a support vector machine is used to detect the human region. Based on the detected human region, the frames of the color video are cropped and the corresponding frames of the optical flow video are obtained using the Farnebäck method (https://docs.opencv=.org/3.4/d4/dee/ tutorial_optical_flow.html). The number of frames of these videos is then unified using a frame sampling technique. Subsequently, these three types of videos are input into three modified 3D convolutional neural networks (3D CNNs) separately. The modified 3D CNNs can extract the spatiotemporal features of human actions and recognize them. Finally, these recognition results are integrated to output the final recognition result of human actions. The proposed system can recognize 13 types of human actions, namely, drink (sit), drink (stand), eat (sit), eat (stand), read, sit down, stand up, use a computer, walk (horizontal), walk (straight), play with a phone/tablet, walk away from each other, and walk toward each other. The average human action recognition rate of 369 test human action videos was 96.4%, indicating that the proposed system is robust and efficient.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Signal Processing and Machine Learning

自引率

0.00%

发文量