JiHwan Moon , Jihoon Park , Jungeun Kim , Jongseong Bae , Hyeongwoo Jeon , Ha Young Kim
{"title":"通过扩散模型增强手语翻译的多样性","authors":"JiHwan Moon , Jihoon Park , Jungeun Kim , Jongseong Bae , Hyeongwoo Jeon , Ha Young Kim","doi":"10.1016/j.patrec.2025.06.008","DOIUrl":null,"url":null,"abstract":"<div><div>Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language across the modalities. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a gloss-free SLT framework that leverages the diffusion model, enabling diverse translations while preserving sign language semantics. DiffSLT transforms random noise into the target latent representation, conditioned on the visual features of input video. To enhance visual conditioning, we design Guidance Fusion Module, which integrates the multi-level spatiotemporal information of visual features. We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses and visual features, providing key textual guidance and reducing the modality gap. As a result, DiffSLT and DiffSLT-P significantly improve diversity over prior gloss-free SLT methods and achieve state-of-the-art performance on the SLT datasets, markedly improving translation quality.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"196 ","pages":"Pages 117-125"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DiffSLT: Enhancing diversity in sign language translation via diffusion model\",\"authors\":\"JiHwan Moon , Jihoon Park , Jungeun Kim , Jongseong Bae , Hyeongwoo Jeon , Ha Young Kim\",\"doi\":\"10.1016/j.patrec.2025.06.008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language across the modalities. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a gloss-free SLT framework that leverages the diffusion model, enabling diverse translations while preserving sign language semantics. DiffSLT transforms random noise into the target latent representation, conditioned on the visual features of input video. To enhance visual conditioning, we design Guidance Fusion Module, which integrates the multi-level spatiotemporal information of visual features. We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses and visual features, providing key textual guidance and reducing the modality gap. As a result, DiffSLT and DiffSLT-P significantly improve diversity over prior gloss-free SLT methods and achieve state-of-the-art performance on the SLT datasets, markedly improving translation quality.</div></div>\",\"PeriodicalId\":54638,\"journal\":{\"name\":\"Pattern Recognition Letters\",\"volume\":\"196 \",\"pages\":\"Pages 117-125\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167865525002363\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002363","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
DiffSLT: Enhancing diversity in sign language translation via diffusion model
Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language across the modalities. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a gloss-free SLT framework that leverages the diffusion model, enabling diverse translations while preserving sign language semantics. DiffSLT transforms random noise into the target latent representation, conditioned on the visual features of input video. To enhance visual conditioning, we design Guidance Fusion Module, which integrates the multi-level spatiotemporal information of visual features. We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses and visual features, providing key textual guidance and reducing the modality gap. As a result, DiffSLT and DiffSLT-P significantly improve diversity over prior gloss-free SLT methods and achieve state-of-the-art performance on the SLT datasets, markedly improving translation quality.
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.