{"title":"结合图像归纳偏差和自注意机制的孤立手语准确识别","authors":"Jieshun You, Zekai He, Shun-Ping Lin, Ling Chen","doi":"10.1109/ICCCS57501.2023.10150914","DOIUrl":null,"url":null,"abstract":"Isolated sign language recognition has been an important part of breaking down communication bottlenecks for deaf-mute and others. While facing this problem, the purpose of this paper is to classify American isolated sign language video by modeling pose, hands and face keypoints representation. Specifically, this paper introduces a novel framework whose main components are the altered Dense Predictive Coding (DPC) pre-trained model and the Encoder pre-trained model. The DPC model is trained using self-supervised learning to obtain representation of pose and hands keypoints. The Encoder model is trained using supervised learning to obtain representation of face keypoints. Combining the altered DPC model with image inductive biases and the Encoder model with a self-attention mechanism, the final combined model achieves 0.81 on the test set of the ISAL dataset, outperforming the current open-source solution by a significant margin.","PeriodicalId":266168,"journal":{"name":"2023 8th International Conference on Computer and Communication Systems (ICCCS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Combining Image Inductive Bias and Self-Attention Mechanism for Accurate Isolated Sign Language Recognition\",\"authors\":\"Jieshun You, Zekai He, Shun-Ping Lin, Ling Chen\",\"doi\":\"10.1109/ICCCS57501.2023.10150914\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Isolated sign language recognition has been an important part of breaking down communication bottlenecks for deaf-mute and others. While facing this problem, the purpose of this paper is to classify American isolated sign language video by modeling pose, hands and face keypoints representation. Specifically, this paper introduces a novel framework whose main components are the altered Dense Predictive Coding (DPC) pre-trained model and the Encoder pre-trained model. The DPC model is trained using self-supervised learning to obtain representation of pose and hands keypoints. The Encoder model is trained using supervised learning to obtain representation of face keypoints. Combining the altered DPC model with image inductive biases and the Encoder model with a self-attention mechanism, the final combined model achieves 0.81 on the test set of the ISAL dataset, outperforming the current open-source solution by a significant margin.\",\"PeriodicalId\":266168,\"journal\":{\"name\":\"2023 8th International Conference on Computer and Communication Systems (ICCCS)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 8th International Conference on Computer and Communication Systems (ICCCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCCS57501.2023.10150914\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 8th International Conference on Computer and Communication Systems (ICCCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCS57501.2023.10150914","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Combining Image Inductive Bias and Self-Attention Mechanism for Accurate Isolated Sign Language Recognition
Isolated sign language recognition has been an important part of breaking down communication bottlenecks for deaf-mute and others. While facing this problem, the purpose of this paper is to classify American isolated sign language video by modeling pose, hands and face keypoints representation. Specifically, this paper introduces a novel framework whose main components are the altered Dense Predictive Coding (DPC) pre-trained model and the Encoder pre-trained model. The DPC model is trained using self-supervised learning to obtain representation of pose and hands keypoints. The Encoder model is trained using supervised learning to obtain representation of face keypoints. Combining the altered DPC model with image inductive biases and the Encoder model with a self-attention mechanism, the final combined model achieves 0.81 on the test set of the ISAL dataset, outperforming the current open-source solution by a significant margin.