Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm

arXiv - CS - Neural and Evolutionary Computing Pub Date : 2024-08-20 DOI:arxiv-2408.10488

Xiao Wang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang, Yaowei Wang

{"title":"Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm","authors":"Xiao Wang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang, Yaowei Wang","doi":"arxiv-2408.10488","DOIUrl":null,"url":null,"abstract":"Sign Language Translation (SLT) is a core task in the field of AI-assisted\ndisability. Unlike traditional SLT based on visible light videos, which is\neasily affected by factors such as lighting, rapid hand movements, and privacy\nbreaches, this paper proposes the use of high-definition Event streams for SLT,\neffectively mitigating the aforementioned issues. This is primarily because\nEvent streams have a high dynamic range and dense temporal signals, which can\nwithstand low illumination and motion blur well. Additionally, due to their\nsparsity in space, they effectively protect the privacy of the target person.\nMore specifically, we propose a new high-resolution Event stream sign language\ndataset, termed Event-CSL, which effectively fills the data gap in this area of\nresearch. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in\nthe text vocabulary. These samples are collected in a variety of indoor and\noutdoor scenes, encompassing multiple angles, light intensities, and camera\nmovements. We have benchmarked existing mainstream SLT works to enable fair\ncomparison for future efforts. Based on this dataset and several other\nlarge-scale datasets, we propose a novel baseline method that fully leverages\nthe Mamba model's ability to integrate temporal information of CNN features,\nresulting in improved sign language translation outcomes. Both the benchmark\ndataset and source code will be released on\nhttps://github.com/Event-AHU/OpenESL","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"52 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.10488","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Unlike traditional SLT based on visible light videos, which is easily affected by factors such as lighting, rapid hand movements, and privacy breaches, this paper proposes the use of high-definition Event streams for SLT, effectively mitigating the aforementioned issues. This is primarily because Event streams have a high dynamic range and dense temporal signals, which can withstand low illumination and motion blur well. Additionally, due to their sparsity in space, they effectively protect the privacy of the target person. More specifically, we propose a new high-resolution Event stream sign language dataset, termed Event-CSL, which effectively fills the data gap in this area of research. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected in a variety of indoor and outdoor scenes, encompassing multiple angles, light intensities, and camera movements. We have benchmarked existing mainstream SLT works to enable fair comparison for future efforts. Based on this dataset and several other large-scale datasets, we propose a novel baseline method that fully leverages the Mamba model's ability to integrate temporal information of CNN features, resulting in improved sign language translation outcomes. Both the benchmark dataset and source code will be released on https://github.com/Event-AHU/OpenESL

查看原文本刊更多论文

基于事件流的手语翻译：高清基准数据集与新算法

手语翻译（SLT）是人工智能辅助残疾领域的一项核心任务。传统的手语翻译基于可见光视频，容易受到光线、快速手部动作和隐私泄露等因素的影响，而本文提出使用高清事件流进行手语翻译，有效缓解了上述问题。这主要是因为事件流具有高动态范围和密集的时间信号，能够很好地抵御低照度和运动模糊。更具体地说，我们提出了一个新的高分辨率事件流手势语言数据集，称为 Event-CSL，它有效地填补了这一研究领域的数据空白。它包含 14,827 个视频、14,821 个词汇和 2,544 个中文文本词汇。这些样本是在各种室内和室外场景中收集的，包括多角度、光照强度和摄像机运动。我们对现有的主流 SLT 作品进行了基准测试，以便为今后的工作提供公平的比较。基于该数据集和其他几个大规模数据集，我们提出了一种新颖的基准方法，该方法充分利用了 Mamba 模型整合 CNN 特征的时间信息的能力，从而提高了手语翻译效果。基准数据集和源代码都将在 https://github.com/Event-AHU/OpenESL 上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Neural and Evolutionary Computing

自引率

0.00%

发文量