{"title":"基于知识迁移学习的多视角融合手语识别","authors":"Liqing Gao, Lei Zhu, Senhua Xue, Liang Wan, Ping Li, Wei Feng","doi":"10.1145/3574131.3574434","DOIUrl":null,"url":null,"abstract":"Word-level sign language recognition (WSLR), which aims to translate a sign video into one word, serves as a fundamental task in visual sign language research. Existing WSLR methods focus on recognizing frontal view hand images, which may hurt performance due to hand occlusion. However, non-frontal view hand images contain complementary and beneficial information that can be used to enhance the frontal view hand images. Based on this observation, the paper presents an end-to-end Multi-View Knowledge Transfer (MVKT) network, which, to our knowledge, is the first SLR work to learn visual features from multiple views simultaneously. The model consists of three components: 1) the 3D-ResNet backbone, to extract view-common and view-specific representations; 2) the Knowledge Transfer module, to interchange complementary information between views; and 3) the View Fusion module, to aggregate discriminative representations for obtaining global clues. In addition, we construct a Multi-View Sign Language (MVSL) dataset, which contains 10,500 sign language videos synchronously collected from multiple views with clear annotations and high quality. Extensive experiments on the MVSL dataset shows that the MVKT model trained with multiple views can achieve significant improvement when tested with either multiple or single views, which makes it feasible and effective in real-world applications.","PeriodicalId":111802,"journal":{"name":"Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-View Fusion for Sign Language Recognition through Knowledge Transfer Learning\",\"authors\":\"Liqing Gao, Lei Zhu, Senhua Xue, Liang Wan, Ping Li, Wei Feng\",\"doi\":\"10.1145/3574131.3574434\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word-level sign language recognition (WSLR), which aims to translate a sign video into one word, serves as a fundamental task in visual sign language research. Existing WSLR methods focus on recognizing frontal view hand images, which may hurt performance due to hand occlusion. However, non-frontal view hand images contain complementary and beneficial information that can be used to enhance the frontal view hand images. Based on this observation, the paper presents an end-to-end Multi-View Knowledge Transfer (MVKT) network, which, to our knowledge, is the first SLR work to learn visual features from multiple views simultaneously. The model consists of three components: 1) the 3D-ResNet backbone, to extract view-common and view-specific representations; 2) the Knowledge Transfer module, to interchange complementary information between views; and 3) the View Fusion module, to aggregate discriminative representations for obtaining global clues. In addition, we construct a Multi-View Sign Language (MVSL) dataset, which contains 10,500 sign language videos synchronously collected from multiple views with clear annotations and high quality. Extensive experiments on the MVSL dataset shows that the MVKT model trained with multiple views can achieve significant improvement when tested with either multiple or single views, which makes it feasible and effective in real-world applications.\",\"PeriodicalId\":111802,\"journal\":{\"name\":\"Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3574131.3574434\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3574131.3574434","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-View Fusion for Sign Language Recognition through Knowledge Transfer Learning
Word-level sign language recognition (WSLR), which aims to translate a sign video into one word, serves as a fundamental task in visual sign language research. Existing WSLR methods focus on recognizing frontal view hand images, which may hurt performance due to hand occlusion. However, non-frontal view hand images contain complementary and beneficial information that can be used to enhance the frontal view hand images. Based on this observation, the paper presents an end-to-end Multi-View Knowledge Transfer (MVKT) network, which, to our knowledge, is the first SLR work to learn visual features from multiple views simultaneously. The model consists of three components: 1) the 3D-ResNet backbone, to extract view-common and view-specific representations; 2) the Knowledge Transfer module, to interchange complementary information between views; and 3) the View Fusion module, to aggregate discriminative representations for obtaining global clues. In addition, we construct a Multi-View Sign Language (MVSL) dataset, which contains 10,500 sign language videos synchronously collected from multiple views with clear annotations and high quality. Extensive experiments on the MVSL dataset shows that the MVKT model trained with multiple views can achieve significant improvement when tested with either multiple or single views, which makes it feasible and effective in real-world applications.