{"title":"Cross-Attention Based Influence Model for Manual and Nonmanual Sign Language Analysis","authors":"Lipisha Chaudhary, Fei Xu, Ifeoma Nwogu","doi":"arxiv-2409.08162","DOIUrl":null,"url":null,"abstract":"Both manual (relating to the use of hands) and non-manual markers (NMM), such\nas facial expressions or mouthing cues, are important for providing the\ncomplete meaning of phrases in American Sign Language (ASL). Efforts have been\nmade in advancing sign language to spoken/written language understanding, but\nmost of these have primarily focused on manual features only. In this work,\nusing advanced neural machine translation methods, we examine and report on the\nextent to which facial expressions contribute to understanding sign language\nphrases. We present a sign language translation architecture consisting of\ntwo-stream encoders, with one encoder handling the face and the other handling\nthe upper body (with hands). We propose a new parallel cross-attention decoding\nmechanism that is useful for quantifying the influence of each input modality\non the output. The two streams from the encoder are directed simultaneously to\ndifferent attention stacks in the decoder. Examining the properties of the\nparallel cross-attention weights allows us to analyze the importance of facial\nmarkers compared to body and hand features during a translating task.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08162","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Both manual (relating to the use of hands) and non-manual markers (NMM), such
as facial expressions or mouthing cues, are important for providing the
complete meaning of phrases in American Sign Language (ASL). Efforts have been
made in advancing sign language to spoken/written language understanding, but
most of these have primarily focused on manual features only. In this work,
using advanced neural machine translation methods, we examine and report on the
extent to which facial expressions contribute to understanding sign language
phrases. We present a sign language translation architecture consisting of
two-stream encoders, with one encoder handling the face and the other handling
the upper body (with hands). We propose a new parallel cross-attention decoding
mechanism that is useful for quantifying the influence of each input modality
on the output. The two streams from the encoder are directed simultaneously to
different attention stacks in the decoder. Examining the properties of the
parallel cross-attention weights allows us to analyze the importance of facial
markers compared to body and hand features during a translating task.