Attentional bias for hands: Cascade dual-decoder transformer for sign language production

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision Pub Date : 2024-03-08 DOI:10.1049/cvi2.12273

Xiaohan Ma, Rize Jin, Jianming Wang, Tae-Sun Chung

{"title":"Attentional bias for hands: Cascade dual-decoder transformer for sign language production","authors":"Xiaohan Ma, Rize Jin, Jianming Wang, Tae-Sun Chung","doi":"10.1049/cvi2.12273","DOIUrl":null,"url":null,"abstract":"Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign languages convey meaning by means of multiple asynchronous articulators, including manual and non-manual information channels. Recent deep learning-based SLP models directly generate the full-articulatory sign sequence from the text input in an end-to-end manner. However, these models largely down weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. To explore these neglected aspects, an efficient cascade dual-decoder Transformer (CasDual-Transformer) for SLP is proposed to learn, successively, two mappings SLPhand: Text → Hand pose and SLPsign: Text → Sign pose, utilising an attention-based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, a novel spatio-temporal loss to penalise shape dissimilarity and temporal distortions of produced sequences is introduced. Experimental studies are performed on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that the authors’ model demonstrates competitive performance compared to state-of-the-art models, and in some cases, achieves considerable improvements over them.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"696-708"},"PeriodicalIF":1.5000,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12273","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cvi2.12273","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign languages convey meaning by means of multiple asynchronous articulators, including manual and non-manual information channels. Recent deep learning-based SLP models directly generate the full-articulatory sign sequence from the text input in an end-to-end manner. However, these models largely down weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. To explore these neglected aspects, an efficient cascade dual-decoder Transformer (CasDual-Transformer) for SLP is proposed to learn, successively, two mappings SLP_hand: Text → Hand pose and SLP_sign: Text → Sign pose, utilising an attention-based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, a novel spatio-temporal loss to penalise shape dissimilarity and temporal distortions of produced sequences is introduced. Experimental studies are performed on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that the authors’ model demonstrates competitive performance compared to state-of-the-art models, and in some cases, achieves considerable improvements over them.

Abstract Image

查看原文本刊更多论文

手的注意偏差用于手语制作的级联双解码转换器

手语制作（SLP）是指将口语的文字形式转化为相应手语表达的任务。手语通过多个异步发音器（包括手动和非手动信息通道）传达意义。最近基于深度学习的 SLP 模型以端到端的方式直接从文本输入生成完整的发音手势序列。然而，由于平均值回归的影响，这些模型在很大程度上忽略了手动发音中细微差别的重要性。为了探索这些被忽视的方面，我们提出了一种用于 SLP 的高效级联双解码器转换器（CasDual-Transformer），以连续学习两个映射 SLPhand：文本→手部姿势和 SLPsign：文本 → 手势姿势，利用基于注意力的对齐模块，融合前一时间步骤的手部和手势特征，预测当前时间步骤中更具表现力的手势姿势。此外，为了提供更有效的指导，还引入了一种新的时空损失，以惩罚生成序列的形状不相似性和时间扭曲。为了验证所提模型的性能，我们在两个来自不同文化的基准手语数据集上进行了实验研究。定量和定性结果都表明，与最先进的模型相比，作者的模型表现出了极具竞争力的性能，在某些情况下甚至比它们有了相当大的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IET Computer Vision 工程技术-工程：电子与电气

CiteScore

3.30

自引率

11.80%

发文量

审稿时长

3.4 months

期刊介绍： IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision. IET Computer Vision welcomes submissions on the following topics: Biologically and perceptually motivated approaches to low level vision (feature detection, etc.); Perceptual grouping and organisation Representation, analysis and matching of 2D and 3D shape Shape-from-X Object recognition Image understanding Learning with visual inputs Motion analysis and object tracking Multiview scene analysis Cognitive approaches in low, mid and high level vision Control in visual systems Colour, reflectance and light Statistical and probabilistic models Face and gesture Surveillance Biometrics and security Robotics Vehicle guidance Automatic model aquisition Medical image analysis and understanding Aerial scene analysis and remote sensing Deep learning models in computer vision Both methodological and applications orientated papers are welcome. Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review. Special Issues Current Call for Papers: Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf