CSLNSpeech:借助中文手语解决扩展语音分离问题

IF 2.4 3区 计算机科学 Q2 ACOUSTICS
Jiasong Wu , Xuan Li , Taotao Li , Fanman Meng , Youyong Kong , Guanyu Yang , Lotfi Senhadji , Huazhong Shu
{"title":"CSLNSpeech:借助中文手语解决扩展语音分离问题","authors":"Jiasong Wu ,&nbsp;Xuan Li ,&nbsp;Taotao Li ,&nbsp;Fanman Meng ,&nbsp;Youyong Kong ,&nbsp;Guanyu Yang ,&nbsp;Lotfi Senhadji ,&nbsp;Huazhong Shu","doi":"10.1016/j.specom.2024.103131","DOIUrl":null,"url":null,"abstract":"<div><p>Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103131"},"PeriodicalIF":2.4000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language\",\"authors\":\"Jiasong Wu ,&nbsp;Xuan Li ,&nbsp;Taotao Li ,&nbsp;Fanman Meng ,&nbsp;Youyong Kong ,&nbsp;Guanyu Yang ,&nbsp;Lotfi Senhadji ,&nbsp;Huazhong Shu\",\"doi\":\"10.1016/j.specom.2024.103131\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"165 \",\"pages\":\"Article 103131\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016763932400102X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016763932400102X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

以往的视听语音分离方法是将说话者的面部动作与视频中的语音同步,从而对语音分离进行自我监督。在本文中,我们提出了一个模型来解决由面部和手语共同辅助的语音分离问题,我们称之为扩展语音分离问题。我们设计了一个通用的深度学习网络来学习音频、人脸和手语三种模态信息的组合,从而更好地解决语音分离问题。我们引入了一个名为 "中国手语新闻语音(CSLNSpeech)"的大规模数据集来训练模型,其中音频、人脸和手语三种模态并存。实验结果表明,与普通的视听系统相比,所提出的模型性能更好、更稳健。此外,手语模式也可单独用于监督语音分离任务,引入手语有助于听障人士的学习和交流。最后,我们的模型是一个通用的语音分离框架,可以在两个开源视听数据集上实现极具竞争力的分离性能。代码见 https://github.com/iveveive/SLNSpeech
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language

Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信