EvCSLR: Event-Guided Continuous Sign Language Recognition and Benchmark

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI:10.1109/TMM.2024.3521750

Yu Jiang;Yuehang Wang;Siqi Li;Yongji Zhang;Qianren Guo;Qi Chu;Yue Gao

{"title":"EvCSLR: Event-Guided Continuous Sign Language Recognition and Benchmark","authors":"Yu Jiang;Yuehang Wang;Siqi Li;Yongji Zhang;Qianren Guo;Qi Chu;Yue Gao","doi":"10.1109/TMM.2024.3521750","DOIUrl":null,"url":null,"abstract":"Classical continuous sign language recognition (CSLR) suffers from some main challenges in real-world scenarios: accurate inter-frame movement trajectories may fail to be captured by traditional RGB cameras due to the motion blur, and valid information may be insufficient under low-illumination scenarios. In this paper, we for the first time leverage an event camera to overcome the above-mentioned challenges. Event cameras are bio-inspired vision sensors that could efficiently record high-speed sign language movements under low-illumination scenarios and capture human information while eliminating redundant background interference. To fully exploit the benefits of the event camera for CSLR, we propose a novel event-guided multi-modal CSLR framework, which could achieve significant performance under complex scenarios. Specifically, a time redundancy correction (TRCorr) module is proposed to rectify redundant information in the temporal sequences, directing the model to focus on distinctive features. A multi-modal cross-attention interaction (MCAI) module is proposed to facilitate information fusion between events and frame domains. Furthermore, we construct the first event-based CSLR dataset, named <bold>EvCSLR</b>, which will be released as the first event-based CSLR benchmark. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on EvCSLR and PHOENIX-2014 T datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1349-1361"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814091/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Classical continuous sign language recognition (CSLR) suffers from some main challenges in real-world scenarios: accurate inter-frame movement trajectories may fail to be captured by traditional RGB cameras due to the motion blur, and valid information may be insufficient under low-illumination scenarios. In this paper, we for the first time leverage an event camera to overcome the above-mentioned challenges. Event cameras are bio-inspired vision sensors that could efficiently record high-speed sign language movements under low-illumination scenarios and capture human information while eliminating redundant background interference. To fully exploit the benefits of the event camera for CSLR, we propose a novel event-guided multi-modal CSLR framework, which could achieve significant performance under complex scenarios. Specifically, a time redundancy correction (TRCorr) module is proposed to rectify redundant information in the temporal sequences, directing the model to focus on distinctive features. A multi-modal cross-attention interaction (MCAI) module is proposed to facilitate information fusion between events and frame domains. Furthermore, we construct the first event-based CSLR dataset, named EvCSLR, which will be released as the first event-based CSLR benchmark. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on EvCSLR and PHOENIX-2014 T datasets.

查看原文本刊更多论文

EvCSLR：事件导向的连续手语识别与基准测试

经典的连续手语识别（CSLR）在现实场景中面临着一些主要挑战：由于运动模糊，传统的RGB相机可能无法捕获准确的帧间运动轨迹，并且在低照度场景下可能缺乏有效信息。在本文中，我们首次利用事件相机来克服上述挑战。事件相机是一种仿生视觉传感器，可以在低照度情况下有效记录高速手语运动，并在消除冗余背景干扰的同时捕捉人类信息。为了充分发挥事件相机在CSLR中的优势，我们提出了一种新的事件引导多模态CSLR框架，该框架可以在复杂场景下取得显著的性能。具体来说，提出了一个时间冗余校正（TRCorr）模块来校正时间序列中的冗余信息，使模型专注于显著特征。提出了一种多模态跨注意交互（MCAI）模块，以促进事件域和框架域之间的信息融合。此外，我们构建了第一个基于事件的CSLR数据集EvCSLR，该数据集将作为第一个基于事件的CSLR基准发布。实验结果表明，我们的方法在EvCSLR和PHOENIX-2014 T数据集上取得了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.