Recognizing American Sign Language Gestures from Within Continuous Videos

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) Pub Date : 2018-06-01 DOI:10.1109/CVPRW.2018.00280

Yuancheng Ye, Yingli Tian, Matt Huenerfauth, Jingya Liu

{"title":"Recognizing American Sign Language Gestures from Within Continuous Videos","authors":"Yuancheng Ye, Yingli Tian, Matt Huenerfauth, Jingya Liu","doi":"10.1109/CVPRW.2018.00280","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a novel hybrid model, 3D recurrent convolutional neural networks (3DRCNN), to recognize American Sign Language (ASL) gestures and localize their temporal boundaries within continuous videos, by fusing multi-modality features. Our proposed 3DRCNN model integrates 3D convolutional neural network (3DCNN) and enhanced fully connected recurrent neural network (FC-RNN), where 3DCNN learns multi-modality features from RGB, motion, and depth channels, and FC-RNN captures the temporal information among short video clips divided from the original video. Consecutive clips with the same semantic meaning are singled out by applying the sliding window approach to segment the clips on the entire video sequence. To evaluate our method, we collected a new ASL dataset which contains two types of videos: Sequence videos (in which a human performs a list of specific ASL words) and Sentence videos (in which a human performs ASL sentences, containing multiple ASL words). The dataset is fully annotated for each semantic region (i.e. the time duration of each word that the human signer performs) and contains multiple input channels. Our proposed method achieves 69.2% accuracy on the Sequence videos for 27 ASL words, which demonstrates its effectiveness of detecting ASL gestures from continuous videos.","PeriodicalId":150600,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"58","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPRW.2018.00280","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 58

Abstract

In this paper, we propose a novel hybrid model, 3D recurrent convolutional neural networks (3DRCNN), to recognize American Sign Language (ASL) gestures and localize their temporal boundaries within continuous videos, by fusing multi-modality features. Our proposed 3DRCNN model integrates 3D convolutional neural network (3DCNN) and enhanced fully connected recurrent neural network (FC-RNN), where 3DCNN learns multi-modality features from RGB, motion, and depth channels, and FC-RNN captures the temporal information among short video clips divided from the original video. Consecutive clips with the same semantic meaning are singled out by applying the sliding window approach to segment the clips on the entire video sequence. To evaluate our method, we collected a new ASL dataset which contains two types of videos: Sequence videos (in which a human performs a list of specific ASL words) and Sentence videos (in which a human performs ASL sentences, containing multiple ASL words). The dataset is fully annotated for each semantic region (i.e. the time duration of each word that the human signer performs) and contains multiple input channels. Our proposed method achieves 69.2% accuracy on the Sequence videos for 27 ASL words, which demonstrates its effectiveness of detecting ASL gestures from continuous videos.

查看原文本刊更多论文

从连续视频中识别美国手语手势

在本文中，我们提出了一种新的混合模型，3D递归卷积神经网络(3DRCNN)，通过融合多模态特征来识别美国手语(ASL)手势并在连续视频中定位其时间边界。我们提出的3DRCNN模型集成了3D卷积神经网络(3DCNN)和增强的全连接递归神经网络(FC-RNN)，其中3DCNN从RGB、运动和深度通道中学习多模态特征，FC-RNN捕获从原始视频中分离出来的短视频片段的时间信息。采用滑动窗口的方法对整个视频序列中的片段进行分割，从而筛选出具有相同语义的连续片段。为了评估我们的方法，我们收集了一个新的ASL数据集，其中包含两种类型的视频:序列视频(其中人类执行特定的ASL单词列表)和句子视频(其中人类执行包含多个ASL单词的ASL句子)。该数据集对每个语义区域(即人类签名者执行的每个单词的持续时间)进行了完全注释，并包含多个输入通道。我们提出的方法在序列视频中27个ASL单词的准确率达到69.2%，证明了该方法从连续视频中检测ASL手势的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

自引率

0.00%

发文量