C2RL: Content and Context Representation Learning for Gloss-Free Sign Language Translation and Retrieval

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-19 DOI:10.1109/TCSVT.2025.3553052

Zhigang Chen;Benjia Zhou;Yiqing Huang;Jun Wan;Yibo Hu;Hailin Shi;Yanyan Liang;Zhen Lei;Du Zhang

{"title":"C2RL: Content and Context Representation Learning for Gloss-Free Sign Language Translation and Retrieval","authors":"Zhigang Chen;Benjia Zhou;Yiqing Huang;Jun Wan;Yibo Hu;Hailin Shi;Yanyan Liang;Zhen Lei;Du Zhang","doi":"10.1109/TCSVT.2025.3553052","DOIUrl":null,"url":null,"abstract":"Sign Language Representation Learning (SLRL) is crucial for a range of sign language-related downstream tasks such as Sign Language Translation (SLT) and Sign Language Retrieval (SLRet). Recently, many gloss-based and gloss-free SLRL methods have been proposed, showing promising performance. Among them, the gloss-free approach shows promise for strong scalability without relying on gloss annotations. However, it currently faces suboptimal solutions due to challenges in encoding the intricate, context-sensitive characteristics of sign language videos, mainly struggling to discern essential sign features using a non-monotonic video-text alignment strategy. Therefore, we introduce an innovative pretraining paradigm for gloss-free SLRL, called C<sup>2</sup>RL, in this paper. Specifically, rather than merely incorporating a non-monotonic semantic alignment of video and text to learn language-oriented sign features, we emphasize two pivotal aspects of SLRL: Implicit Content Learning (ICL) and Explicit Context Learning (ECL). ICL delves into the content of communication, capturing the nuances, emphasis, timing, and rhythm of the signs. In contrast, ECL focuses on understanding the contextual meaning of signs and converting them into equivalent sentences. Despite its simplicity, extensive experiments confirm that the joint optimization of ICL and ECL results in robust sign language representation and significant performance gains in gloss-free SLT and SLRet tasks. Notably, C<sup>2</sup>RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign. It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign. Additionally, we set a new baseline for the OpenASL dataset in the SLRet task.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8533-8544"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10933970","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10933970/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Sign Language Representation Learning (SLRL) is crucial for a range of sign language-related downstream tasks such as Sign Language Translation (SLT) and Sign Language Retrieval (SLRet). Recently, many gloss-based and gloss-free SLRL methods have been proposed, showing promising performance. Among them, the gloss-free approach shows promise for strong scalability without relying on gloss annotations. However, it currently faces suboptimal solutions due to challenges in encoding the intricate, context-sensitive characteristics of sign language videos, mainly struggling to discern essential sign features using a non-monotonic video-text alignment strategy. Therefore, we introduce an innovative pretraining paradigm for gloss-free SLRL, called C²RL, in this paper. Specifically, rather than merely incorporating a non-monotonic semantic alignment of video and text to learn language-oriented sign features, we emphasize two pivotal aspects of SLRL: Implicit Content Learning (ICL) and Explicit Context Learning (ECL). ICL delves into the content of communication, capturing the nuances, emphasis, timing, and rhythm of the signs. In contrast, ECL focuses on understanding the contextual meaning of signs and converting them into equivalent sentences. Despite its simplicity, extensive experiments confirm that the joint optimization of ICL and ECL results in robust sign language representation and significant performance gains in gloss-free SLT and SLRet tasks. Notably, C²RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign. It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign. Additionally, we set a new baseline for the OpenASL dataset in the SLRet task.

查看原文本刊更多论文

面向无光泽手语翻译与检索的内容与上下文表示学习

手语表征学习（SLRL）对于手语翻译（SLT）和手语检索（SLRet）等一系列与手语相关的下游任务至关重要。近年来，人们提出了许多基于光泽和无光泽的SLRL方法，并显示出良好的性能。其中，无光泽度方法有望在不依赖光泽度注释的情况下实现强大的可伸缩性。然而，由于在编码复杂的、上下文敏感的手语视频特征方面的挑战，目前它面临着次优解决方案，主要是在使用非单调视频文本对齐策略来识别基本的符号特征方面遇到困难。因此，我们在本文中引入了一种创新的无光泽SLRL预训练范式，称为C2RL。具体来说，我们强调了SLRL的两个关键方面：内隐内容学习（ICL）和外显上下文学习（ECL），而不是仅仅结合视频和文本的非单调语义对齐来学习面向语言的符号特征。ICL深入研究交流的内容，捕捉细微差别，重点，时间和节奏的标志。相比之下，ECL侧重于理解符号的语境意义并将它们转换成等价的句子。尽管它很简单，但大量的实验证实，ICL和ECL的联合优化导致了鲁棒的手语表示，并且在无gloss的SLT和SLRet任务中显著提高了性能。值得注意的是，C2RL在P14T上提高了BLEU-4分数+5.3，在CSL-daily上提高了+10.6，在OpenASL上提高了+6.2，在How2Sign上提高了+1.3。它还将《R@1》在P14T上的得分提高了8.3分，在CSL-daily上提高了14.4分，在How2Sign上提高了5.9分。此外，我们在SLRet任务中为OpenASL数据集设置了一个新的基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.