Wesley F. Maia , António M. Lopes , Sergio A. David
{"title":"Automatic sign language to text translation using MediaPipe and transformer architectures","authors":"Wesley F. Maia , António M. Lopes , Sergio A. David","doi":"10.1016/j.neucom.2025.130421","DOIUrl":null,"url":null,"abstract":"<div><div>This study presents a transformer-based architecture for translating Sign Language to spoken language text using embeddings of body keypoints, with the mediation of glosses. To the best of our knowledge, this work is the first to successfully leverage body keypoints for Sign Language-to-text translation, achieving comparable performance to baseline models without reducing translation quality. Our approach introduces extensive augmentation techniques for body keypoints, and convolutional keypoint embeddings, and integrates Connectionist Temporal Classification Loss and position encoding for Sign2Gloss translation. For the Gloss2Text stage, we employ fine-tuning of BART, a state-of-the-art transformer model. Evaluation on the Phoenix14T dataset demonstrates that our integrated Sign2Gloss2Text model achieves competitive performance, with BLEU-4 scores that show marginal differences compared to baseline models using pixel embeddings. On the How2Sign dataset, which lacks gloss annotations, direct Sign2Text translation posed challenges, as reflected in lower BLEU-4 scores, highlighting the limitations of gloss-free approaches. This work addresses the narrow domain of the datasets and the unidirectional nature of the translation process while demonstrating the potential of body keypoints for Sign Language Translation. Future work will focus on enhancing the model’s ability to capture nuanced and complex contexts, thereby advancing accessibility and assistive technologies for bridging communication between individuals with hearing impairments and the hearing community.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"642 ","pages":"Article 130421"},"PeriodicalIF":5.5000,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225010938","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
This study presents a transformer-based architecture for translating Sign Language to spoken language text using embeddings of body keypoints, with the mediation of glosses. To the best of our knowledge, this work is the first to successfully leverage body keypoints for Sign Language-to-text translation, achieving comparable performance to baseline models without reducing translation quality. Our approach introduces extensive augmentation techniques for body keypoints, and convolutional keypoint embeddings, and integrates Connectionist Temporal Classification Loss and position encoding for Sign2Gloss translation. For the Gloss2Text stage, we employ fine-tuning of BART, a state-of-the-art transformer model. Evaluation on the Phoenix14T dataset demonstrates that our integrated Sign2Gloss2Text model achieves competitive performance, with BLEU-4 scores that show marginal differences compared to baseline models using pixel embeddings. On the How2Sign dataset, which lacks gloss annotations, direct Sign2Text translation posed challenges, as reflected in lower BLEU-4 scores, highlighting the limitations of gloss-free approaches. This work addresses the narrow domain of the datasets and the unidirectional nature of the translation process while demonstrating the potential of body keypoints for Sign Language Translation. Future work will focus on enhancing the model’s ability to capture nuanced and complex contexts, thereby advancing accessibility and assistive technologies for bridging communication between individuals with hearing impairments and the hearing community.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.