扩大手语理解的多模态预训练。

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-08-14 DOI:10.1109/TPAMI.2025.3599313

Wengang Zhou, Weichao Zhao, Hezhen Hu, Zecheng Li, Houqiang Li

{"title":"扩大手语理解的多模态预训练。","authors":"Wengang Zhou, Weichao Zhao, Hezhen Hu, Zecheng Li, Houqiang Li","doi":"10.1109/TPAMI.2025.3599313","DOIUrl":null,"url":null,"abstract":"Sign language pre-training (SLP) has significantly improved the performance of diverse sign language understanding (SLU) tasks. However, many existing methods employ pre-training techniques that are tailored to a specific task with small data scale, resulting in limited model generalization. Some others focus solely on exploring visual cues, neglecting semantically textual cues embedded in sign translation texts. These limitations inherently diminish the representative capacity of pre-trained models. To this end, we present a multimodal SLP framework to leverage rich visual contextual information and vision-language semantic consistency with massively available data to enhance the representative capability of sign language video. Specifically, we first curate a large-scale text-labeled sign pose dataset ($\\sim$ 1.5M), namely SL-1.5M, from various sources to alleviate the scarcity of pre-training data. Subsequently, we propose a pre-training framework, which integrates sign-text contrastive learning with masked pose modeling as the pretext task. In this way, our framework is empowered to effectively capture contextual cues within sign pose sequences and learn visual representation by aligning semantical text-rich features in a latent space. Moreover, in order to grasp the comprehensive meaning of sign language videos, we concurrently model manual and non-manual information to ensure the holistic integrity of visual content. To validate the generalization and superiority of our proposed pre-trained framework, we conduct extensive experiments without intricate design on diverse SLU tasks, achieving new state-of-the-art performance on multiple benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scaling up Multimodal Pre-Training for Sign Language Understanding.\",\"authors\":\"Wengang Zhou, Weichao Zhao, Hezhen Hu, Zecheng Li, Houqiang Li\",\"doi\":\"10.1109/TPAMI.2025.3599313\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sign language pre-training (SLP) has significantly improved the performance of diverse sign language understanding (SLU) tasks. However, many existing methods employ pre-training techniques that are tailored to a specific task with small data scale, resulting in limited model generalization. Some others focus solely on exploring visual cues, neglecting semantically textual cues embedded in sign translation texts. These limitations inherently diminish the representative capacity of pre-trained models. To this end, we present a multimodal SLP framework to leverage rich visual contextual information and vision-language semantic consistency with massively available data to enhance the representative capability of sign language video. Specifically, we first curate a large-scale text-labeled sign pose dataset ($\\\\sim$ 1.5M), namely SL-1.5M, from various sources to alleviate the scarcity of pre-training data. Subsequently, we propose a pre-training framework, which integrates sign-text contrastive learning with masked pose modeling as the pretext task. In this way, our framework is empowered to effectively capture contextual cues within sign pose sequences and learn visual representation by aligning semantical text-rich features in a latent space. Moreover, in order to grasp the comprehensive meaning of sign language videos, we concurrently model manual and non-manual information to ensure the holistic integrity of visual content. To validate the generalization and superiority of our proposed pre-trained framework, we conduct extensive experiments without intricate design on diverse SLU tasks, achieving new state-of-the-art performance on multiple benchmarks.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TPAMI.2025.3599313\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2025.3599313","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

手语预训练（SLP）显著提高了手语理解（SLU）任务的表现。然而，许多现有方法采用针对特定任务的小数据规模的预训练技术，导致模型泛化有限。其他一些人只关注探索视觉线索，而忽略了嵌入在符号翻译文本中的语义文本线索。这些限制本质上削弱了预训练模型的代表能力。为此，我们提出了一个多模态SLP框架，利用丰富的视觉上下文信息和视觉语言语义一致性与大量可用数据来增强手语视频的表征能力。具体来说，我们首先从各种来源中策划了一个大规模的文本标记的手势姿态数据集（$\sim$ 150万），即sl - 150万，以缓解预训练数据的稀缺性。随后，我们提出了一个将手语-文本对比学习与伪装姿态建模相结合的预训练框架。通过这种方式，我们的框架能够有效地捕获手势姿势序列中的上下文线索，并通过在潜在空间中对齐语义文本丰富的特征来学习视觉表示。此外，为了把握手语视频的综合意义，我们将手工和非手工信息并行建模，以保证视觉内容的整体完整性。为了验证我们提出的预训练框架的泛化和优越性，我们在不同的SLU任务上进行了大量的实验，没有复杂的设计，在多个基准测试中获得了新的最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scaling up Multimodal Pre-Training for Sign Language Understanding.

Sign language pre-training (SLP) has significantly improved the performance of diverse sign language understanding (SLU) tasks. However, many existing methods employ pre-training techniques that are tailored to a specific task with small data scale, resulting in limited model generalization. Some others focus solely on exploring visual cues, neglecting semantically textual cues embedded in sign translation texts. These limitations inherently diminish the representative capacity of pre-trained models. To this end, we present a multimodal SLP framework to leverage rich visual contextual information and vision-language semantic consistency with massively available data to enhance the representative capability of sign language video. Specifically, we first curate a large-scale text-labeled sign pose dataset ($\sim$ 1.5M), namely SL-1.5M, from various sources to alleviate the scarcity of pre-training data. Subsequently, we propose a pre-training framework, which integrates sign-text contrastive learning with masked pose modeling as the pretext task. In this way, our framework is empowered to effectively capture contextual cues within sign pose sequences and learn visual representation by aligning semantical text-rich features in a latent space. Moreover, in order to grasp the comprehensive meaning of sign language videos, we concurrently model manual and non-manual information to ensure the holistic integrity of visual content. To validate the generalization and superiority of our proposed pre-trained framework, we conduct extensive experiments without intricate design on diverse SLU tasks, achieving new state-of-the-art performance on multiple benchmarks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量