Connectionist Temporal Fusion for Sign Language Translation

Proceedings of the 26th ACM international conference on Multimedia Pub Date : 2018-10-15 DOI:10.1145/3240508.3240671

Shuo Wang, Dan Guo, Wen-gang Zhou, Zhengjun Zha, M. Wang

{"title":"Connectionist Temporal Fusion for Sign Language Translation","authors":"Shuo Wang, Dan Guo, Wen-gang Zhou, Zhengjun Zha, M. Wang","doi":"10.1145/3240508.3240671","DOIUrl":null,"url":null,"abstract":"Continuous sign language translation (CSLT) is a weakly supervised problem aiming at translating vision-based videos into natural languages under complicated sign linguistics, where the ordered words in a sentence label have no exact boundary of each sign action in the video. This paper proposes a hybrid deep architecture which consists of a temporal convolution module (TCOV), a bidirectional gated recurrent unit module (BGRU), and a fusion layer module (FL) to address the CSLT problem. TCOV captures short-term temporal transition on adjacent clip features (local pattern), while BGRU keeps the long-term context transition across temporal dimension (global pattern). FL concatenates the feature embedding of TCOV and BGRU to learn their complementary relationship (mutual pattern). Thus we propose a joint connectionist temporal fusion (CTF) mechanism to utilize the merit of each module. The proposed joint CTC loss optimization and deep classification score-based decoding fusion strategy are designed to boost performance. With only once training, our model under the CTC constraints achieves comparable performance to other existing methods with multiple EM iterations. Experiments are tested and verified on a benchmark, i.e. the RWTH-PHOENIX-Weather dataset, which demonstrate the effectiveness of our proposed method.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"119 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"74","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240508.3240671","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 74

Abstract

Continuous sign language translation (CSLT) is a weakly supervised problem aiming at translating vision-based videos into natural languages under complicated sign linguistics, where the ordered words in a sentence label have no exact boundary of each sign action in the video. This paper proposes a hybrid deep architecture which consists of a temporal convolution module (TCOV), a bidirectional gated recurrent unit module (BGRU), and a fusion layer module (FL) to address the CSLT problem. TCOV captures short-term temporal transition on adjacent clip features (local pattern), while BGRU keeps the long-term context transition across temporal dimension (global pattern). FL concatenates the feature embedding of TCOV and BGRU to learn their complementary relationship (mutual pattern). Thus we propose a joint connectionist temporal fusion (CTF) mechanism to utilize the merit of each module. The proposed joint CTC loss optimization and deep classification score-based decoding fusion strategy are designed to boost performance. With only once training, our model under the CTC constraints achieves comparable performance to other existing methods with multiple EM iterations. Experiments are tested and verified on a benchmark, i.e. the RWTH-PHOENIX-Weather dataset, which demonstrate the effectiveness of our proposed method.

查看原文本刊更多论文

手语翻译的联结主义时间融合

连续手语翻译(CSLT)是一种弱监督问题，旨在将基于视觉的视频翻译成复杂符号语言学条件下的自然语言，其中句子标签中的有序词在视频中的每个手势动作没有精确的边界。本文提出了一种由时间卷积模块(TCOV)、双向门控循环单元模块(BGRU)和融合层模块(FL)组成的混合深度架构来解决CSLT问题。TCOV捕获邻近片段特征的短期时间转换(局部模式)，而BGRU保持跨时间维度的长期上下文转换(全局模式)。FL将TCOV和BGRU的特征嵌入进行连接，学习它们之间的互补关系(互模式)。因此，我们提出了一种联合连接时间融合(CTF)机制来利用每个模块的优点。提出了基于CTC损失优化和基于深度分类分数的译码融合策略来提高性能。只需一次训练，我们的模型在CTC约束下就可以达到与其他现有的具有多次EM迭代的方法相当的性能。实验在RWTH-PHOENIX-Weather数据集上进行了测试和验证，验证了本文方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th ACM international conference on Multimedia

自引率

0.00%

发文量