{"title":"基于空间卷积和时间卷积的手势视频识别","authors":"Denden Raka Setiawan, E. C. Djamal, Fikri Nugraha","doi":"10.1109/ic2ie53219.2021.9649052","DOIUrl":null,"url":null,"abstract":"Hand gestures can be used for indirect interaction. The hand movement in the video causes the hand position of each frame to move. Position shift can be information in the identification of hand movements. However, video identification is not easy; it requires comprehensive feature detection in each part of the frame and identifies connectivity patterns between each frame. The use of architecture in recognizing each characteristic pattern in each frame can affect the identification results due to the number and arrangement of layers in extracting features in each frame. Previous research has identified video hand movements used Convolutional Neural Networks (CNN) with the Single-Stream Spatial CNN method to identifying hand movement patterns but ignoring the relationship between frames. In other research, the Single-Stream Temporal CNN was used to identify hand movements used two frames to connected the relationship between frames at a specific time. This research proposed the Two-Stream CNN method, namely spatial and temporal. Spatial to get the pattern on the frame as a whole. Temporal to obtain information on the relationship between the frame and Optical Flow used the Gunnar Farneback method by looking for light transfer points in pixels in all parts of the frame between two interconnected frames. Two different architectures and optimizers were used, namely VGG16 and Xception, for the architecture, while the optimizer were SGD and AdaDelta. The proposed method used two architectures, namely VGG16 and Xception. Besides, the optimizer model used SGD and AdaDelta. As a result, the Xception architecture with the SGD optimizer model provided a higher accuracy of 98.68%.","PeriodicalId":178443,"journal":{"name":"2021 4th International Conference of Computer and Informatics Engineering (IC2IE)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hand Gesture Video Identification using Spatial and Temporal Convolutional\",\"authors\":\"Denden Raka Setiawan, E. C. Djamal, Fikri Nugraha\",\"doi\":\"10.1109/ic2ie53219.2021.9649052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hand gestures can be used for indirect interaction. The hand movement in the video causes the hand position of each frame to move. Position shift can be information in the identification of hand movements. However, video identification is not easy; it requires comprehensive feature detection in each part of the frame and identifies connectivity patterns between each frame. The use of architecture in recognizing each characteristic pattern in each frame can affect the identification results due to the number and arrangement of layers in extracting features in each frame. Previous research has identified video hand movements used Convolutional Neural Networks (CNN) with the Single-Stream Spatial CNN method to identifying hand movement patterns but ignoring the relationship between frames. In other research, the Single-Stream Temporal CNN was used to identify hand movements used two frames to connected the relationship between frames at a specific time. This research proposed the Two-Stream CNN method, namely spatial and temporal. Spatial to get the pattern on the frame as a whole. Temporal to obtain information on the relationship between the frame and Optical Flow used the Gunnar Farneback method by looking for light transfer points in pixels in all parts of the frame between two interconnected frames. Two different architectures and optimizers were used, namely VGG16 and Xception, for the architecture, while the optimizer were SGD and AdaDelta. The proposed method used two architectures, namely VGG16 and Xception. Besides, the optimizer model used SGD and AdaDelta. As a result, the Xception architecture with the SGD optimizer model provided a higher accuracy of 98.68%.\",\"PeriodicalId\":178443,\"journal\":{\"name\":\"2021 4th International Conference of Computer and Informatics Engineering (IC2IE)\",\"volume\":\"216 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 4th International Conference of Computer and Informatics Engineering (IC2IE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ic2ie53219.2021.9649052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 4th International Conference of Computer and Informatics Engineering (IC2IE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ic2ie53219.2021.9649052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hand Gesture Video Identification using Spatial and Temporal Convolutional
Hand gestures can be used for indirect interaction. The hand movement in the video causes the hand position of each frame to move. Position shift can be information in the identification of hand movements. However, video identification is not easy; it requires comprehensive feature detection in each part of the frame and identifies connectivity patterns between each frame. The use of architecture in recognizing each characteristic pattern in each frame can affect the identification results due to the number and arrangement of layers in extracting features in each frame. Previous research has identified video hand movements used Convolutional Neural Networks (CNN) with the Single-Stream Spatial CNN method to identifying hand movement patterns but ignoring the relationship between frames. In other research, the Single-Stream Temporal CNN was used to identify hand movements used two frames to connected the relationship between frames at a specific time. This research proposed the Two-Stream CNN method, namely spatial and temporal. Spatial to get the pattern on the frame as a whole. Temporal to obtain information on the relationship between the frame and Optical Flow used the Gunnar Farneback method by looking for light transfer points in pixels in all parts of the frame between two interconnected frames. Two different architectures and optimizers were used, namely VGG16 and Xception, for the architecture, while the optimizer were SGD and AdaDelta. The proposed method used two architectures, namely VGG16 and Xception. Besides, the optimizer model used SGD and AdaDelta. As a result, the Xception architecture with the SGD optimizer model provided a higher accuracy of 98.68%.