基于空间卷积和时间卷积的手势视频识别

2021 4th International Conference of Computer and Informatics Engineering (IC2IE) Pub Date : 2021-09-14 DOI:10.1109/ic2ie53219.2021.9649052

Denden Raka Setiawan, E. C. Djamal, Fikri Nugraha

{"title":"基于空间卷积和时间卷积的手势视频识别","authors":"Denden Raka Setiawan, E. C. Djamal, Fikri Nugraha","doi":"10.1109/ic2ie53219.2021.9649052","DOIUrl":null,"url":null,"abstract":"Hand gestures can be used for indirect interaction. The hand movement in the video causes the hand position of each frame to move. Position shift can be information in the identification of hand movements. However, video identification is not easy; it requires comprehensive feature detection in each part of the frame and identifies connectivity patterns between each frame. The use of architecture in recognizing each characteristic pattern in each frame can affect the identification results due to the number and arrangement of layers in extracting features in each frame. Previous research has identified video hand movements used Convolutional Neural Networks (CNN) with the Single-Stream Spatial CNN method to identifying hand movement patterns but ignoring the relationship between frames. In other research, the Single-Stream Temporal CNN was used to identify hand movements used two frames to connected the relationship between frames at a specific time. This research proposed the Two-Stream CNN method, namely spatial and temporal. Spatial to get the pattern on the frame as a whole. Temporal to obtain information on the relationship between the frame and Optical Flow used the Gunnar Farneback method by looking for light transfer points in pixels in all parts of the frame between two interconnected frames. Two different architectures and optimizers were used, namely VGG16 and Xception, for the architecture, while the optimizer were SGD and AdaDelta. The proposed method used two architectures, namely VGG16 and Xception. Besides, the optimizer model used SGD and AdaDelta. As a result, the Xception architecture with the SGD optimizer model provided a higher accuracy of 98.68%.","PeriodicalId":178443,"journal":{"name":"2021 4th International Conference of Computer and Informatics Engineering (IC2IE)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hand Gesture Video Identification using Spatial and Temporal Convolutional\",\"authors\":\"Denden Raka Setiawan, E. C. Djamal, Fikri Nugraha\",\"doi\":\"10.1109/ic2ie53219.2021.9649052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hand gestures can be used for indirect interaction. The hand movement in the video causes the hand position of each frame to move. Position shift can be information in the identification of hand movements. However, video identification is not easy; it requires comprehensive feature detection in each part of the frame and identifies connectivity patterns between each frame. The use of architecture in recognizing each characteristic pattern in each frame can affect the identification results due to the number and arrangement of layers in extracting features in each frame. Previous research has identified video hand movements used Convolutional Neural Networks (CNN) with the Single-Stream Spatial CNN method to identifying hand movement patterns but ignoring the relationship between frames. In other research, the Single-Stream Temporal CNN was used to identify hand movements used two frames to connected the relationship between frames at a specific time. This research proposed the Two-Stream CNN method, namely spatial and temporal. Spatial to get the pattern on the frame as a whole. Temporal to obtain information on the relationship between the frame and Optical Flow used the Gunnar Farneback method by looking for light transfer points in pixels in all parts of the frame between two interconnected frames. Two different architectures and optimizers were used, namely VGG16 and Xception, for the architecture, while the optimizer were SGD and AdaDelta. The proposed method used two architectures, namely VGG16 and Xception. Besides, the optimizer model used SGD and AdaDelta. As a result, the Xception architecture with the SGD optimizer model provided a higher accuracy of 98.68%.\",\"PeriodicalId\":178443,\"journal\":{\"name\":\"2021 4th International Conference of Computer and Informatics Engineering (IC2IE)\",\"volume\":\"216 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 4th International Conference of Computer and Informatics Engineering (IC2IE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ic2ie53219.2021.9649052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 4th International Conference of Computer and Informatics Engineering (IC2IE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ic2ie53219.2021.9649052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

手势可以用于间接互动。视频中的手的移动导致每一帧的手的位置移动。位置移动可以是识别手部动作的信息。然而，视频识别并不容易;它需要对帧的每个部分进行全面的特征检测，并识别每个帧之间的连接模式。在识别每一帧的特征模式时，由于每一帧提取特征的层数和层的排列，使用体系结构会影响识别结果。之前的研究使用卷积神经网络(CNN)和单流空间CNN方法来识别手部运动模式，但忽略了帧之间的关系。在其他研究中，使用单流时态CNN来识别手部运动，使用两帧来连接特定时间帧之间的关系。本研究提出了双流CNN方法，即空间和时间。空间上要把图案放在框架上作为一个整体。Temporal采用Gunnar Farneback方法，通过在两个相互连接的帧之间的帧的所有部分的像素中寻找光传递点来获取帧与光流之间的关系信息。该架构使用了两种不同的架构和优化器，分别是VGG16和Xception，而优化器是SGD和AdaDelta。该方法采用VGG16和Xception两种体系结构。此外，优化器模型使用了SGD和AdaDelta。因此，使用SGD优化器模型的exception体系结构提供了98.68%的更高精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hand Gesture Video Identification using Spatial and Temporal Convolutional

Hand gestures can be used for indirect interaction. The hand movement in the video causes the hand position of each frame to move. Position shift can be information in the identification of hand movements. However, video identification is not easy; it requires comprehensive feature detection in each part of the frame and identifies connectivity patterns between each frame. The use of architecture in recognizing each characteristic pattern in each frame can affect the identification results due to the number and arrangement of layers in extracting features in each frame. Previous research has identified video hand movements used Convolutional Neural Networks (CNN) with the Single-Stream Spatial CNN method to identifying hand movement patterns but ignoring the relationship between frames. In other research, the Single-Stream Temporal CNN was used to identify hand movements used two frames to connected the relationship between frames at a specific time. This research proposed the Two-Stream CNN method, namely spatial and temporal. Spatial to get the pattern on the frame as a whole. Temporal to obtain information on the relationship between the frame and Optical Flow used the Gunnar Farneback method by looking for light transfer points in pixels in all parts of the frame between two interconnected frames. Two different architectures and optimizers were used, namely VGG16 and Xception, for the architecture, while the optimizer were SGD and AdaDelta. The proposed method used two architectures, namely VGG16 and Xception. Besides, the optimizer model used SGD and AdaDelta. As a result, the Xception architecture with the SGD optimizer model provided a higher accuracy of 98.68%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 4th International Conference of Computer and Informatics Engineering (IC2IE)

自引率

0.00%

发文量