端到端手语拼写识别的无标记数据多任务训练

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-10-09 DOI:10.1109/ASRU.2017.8268962

Bowen Shi, Karen Livescu

{"title":"端到端手语拼写识别的无标记数据多任务训练","authors":"Bowen Shi, Karen Livescu","doi":"10.1109/ASRU.2017.8268962","DOIUrl":null,"url":null,"abstract":"We address the problem of automatic American Sign Language fingerspelling recognition from video. Prior work has largely relied on frame-level labels, hand-crafted features, or other constraints, and has been hampered by the scarcity of data for this task. We introduce a model for fingerspelling recognition that addresses these issues. The model consists of an auto-encoder-based feature extractor and an attention-based neural encoder-decoder, which are trained jointly. The model receives a sequence of image frames and outputs the fingerspelled word, without relying on any frame-level training labels or hand-crafted features. In addition, the auto-encoder subcomponent makes it possible to leverage unlabeled data to improve the feature learning. The model achieves 11.6% and 4.4% absolute letter accuracy improvement respectively in signer-independent and signer-adapted fingerspelling recognition over previous approaches that required frame-level training labels.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition\",\"authors\":\"Bowen Shi, Karen Livescu\",\"doi\":\"10.1109/ASRU.2017.8268962\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We address the problem of automatic American Sign Language fingerspelling recognition from video. Prior work has largely relied on frame-level labels, hand-crafted features, or other constraints, and has been hampered by the scarcity of data for this task. We introduce a model for fingerspelling recognition that addresses these issues. The model consists of an auto-encoder-based feature extractor and an attention-based neural encoder-decoder, which are trained jointly. The model receives a sequence of image frames and outputs the fingerspelled word, without relying on any frame-level training labels or hand-crafted features. In addition, the auto-encoder subcomponent makes it possible to leverage unlabeled data to improve the feature learning. The model achieves 11.6% and 4.4% absolute letter accuracy improvement respectively in signer-independent and signer-adapted fingerspelling recognition over previous approaches that required frame-level training labels.\",\"PeriodicalId\":290868,\"journal\":{\"name\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU.2017.8268962\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268962","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

我们解决了从视频中自动识别美国手语指纹拼写的问题。先前的工作很大程度上依赖于框架级标签、手工制作的特征或其他限制，并且由于该任务的数据稀缺而受到阻碍。我们引入了一个用于指纹识别的模型来解决这些问题。该模型由一个基于自编码器的特征提取器和一个基于注意力的神经编码器组成，它们是联合训练的。该模型接收一系列图像帧并输出手指拼写的单词，而不依赖于任何帧级训练标签或手工制作的特征。此外，自编码器子组件使得利用未标记数据来改进特征学习成为可能。与之前需要帧级训练标签的方法相比，该模型在独立于签名人和适应签名人的指纹识别中分别实现了11.6%和4.4%的绝对字母准确率提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition

We address the problem of automatic American Sign Language fingerspelling recognition from video. Prior work has largely relied on frame-level labels, hand-crafted features, or other constraints, and has been hampered by the scarcity of data for this task. We introduce a model for fingerspelling recognition that addresses these issues. The model consists of an auto-encoder-based feature extractor and an attention-based neural encoder-decoder, which are trained jointly. The model receives a sequence of image frames and outputs the fingerspelled word, without relying on any frame-level training labels or hand-crafted features. In addition, the auto-encoder subcomponent makes it possible to leverage unlabeled data to improve the feature learning. The model achieves 11.6% and 4.4% absolute letter accuracy improvement respectively in signer-independent and signer-adapted fingerspelling recognition over previous approaches that required frame-level training labels.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量