Multi-View Fusion for Sign Language Recognition through Knowledge Transfer Learning

Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry Pub Date : 2022-12-27 DOI:10.1145/3574131.3574434

Liqing Gao, Lei Zhu, Senhua Xue, Liang Wan, Ping Li, Wei Feng

{"title":"Multi-View Fusion for Sign Language Recognition through Knowledge Transfer Learning","authors":"Liqing Gao, Lei Zhu, Senhua Xue, Liang Wan, Ping Li, Wei Feng","doi":"10.1145/3574131.3574434","DOIUrl":null,"url":null,"abstract":"Word-level sign language recognition (WSLR), which aims to translate a sign video into one word, serves as a fundamental task in visual sign language research. Existing WSLR methods focus on recognizing frontal view hand images, which may hurt performance due to hand occlusion. However, non-frontal view hand images contain complementary and beneficial information that can be used to enhance the frontal view hand images. Based on this observation, the paper presents an end-to-end Multi-View Knowledge Transfer (MVKT) network, which, to our knowledge, is the first SLR work to learn visual features from multiple views simultaneously. The model consists of three components: 1) the 3D-ResNet backbone, to extract view-common and view-specific representations; 2) the Knowledge Transfer module, to interchange complementary information between views; and 3) the View Fusion module, to aggregate discriminative representations for obtaining global clues. In addition, we construct a Multi-View Sign Language (MVSL) dataset, which contains 10,500 sign language videos synchronously collected from multiple views with clear annotations and high quality. Extensive experiments on the MVSL dataset shows that the MVKT model trained with multiple views can achieve significant improvement when tested with either multiple or single views, which makes it feasible and effective in real-world applications.","PeriodicalId":111802,"journal":{"name":"Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3574131.3574434","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Word-level sign language recognition (WSLR), which aims to translate a sign video into one word, serves as a fundamental task in visual sign language research. Existing WSLR methods focus on recognizing frontal view hand images, which may hurt performance due to hand occlusion. However, non-frontal view hand images contain complementary and beneficial information that can be used to enhance the frontal view hand images. Based on this observation, the paper presents an end-to-end Multi-View Knowledge Transfer (MVKT) network, which, to our knowledge, is the first SLR work to learn visual features from multiple views simultaneously. The model consists of three components: 1) the 3D-ResNet backbone, to extract view-common and view-specific representations; 2) the Knowledge Transfer module, to interchange complementary information between views; and 3) the View Fusion module, to aggregate discriminative representations for obtaining global clues. In addition, we construct a Multi-View Sign Language (MVSL) dataset, which contains 10,500 sign language videos synchronously collected from multiple views with clear annotations and high quality. Extensive experiments on the MVSL dataset shows that the MVKT model trained with multiple views can achieve significant improvement when tested with either multiple or single views, which makes it feasible and effective in real-world applications.

查看原文本刊更多论文

基于知识迁移学习的多视角融合手语识别

单词级手语识别(WSLR)是视觉手语研究的一项基本任务，其目的是将手语视频翻译成一个单词。现有的WSLR方法主要集中在识别正面视图的手部图像，这可能会由于手部遮挡而影响识别性能。然而，非正面视图手图像包含互补和有益的信息，可用于增强正面视图手图像。基于此，本文提出了一个端到端的多视图知识转移(MVKT)网络，据我们所知，这是第一个同时从多个视图学习视觉特征的单反工作。该模型由三个部分组成:1)3D-ResNet主干，用于提取视图通用和视图特定的表示;2) Knowledge Transfer模块，实现视图间互补信息的交换;3) View Fusion模块，用于聚合判别表示以获取全局线索。此外，我们构建了一个多视图手语(MVSL)数据集，该数据集包含10,500个同步从多个视图中收集的手语视频，注释清晰，质量高。在MVSL数据集上的大量实验表明，用多视图训练的MVKT模型在使用多视图或单视图测试时都能取得显著的改进，证明了该模型在实际应用中的可行性和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry

自引率

0.00%

发文量