针对分离视觉语音表征的音频引导自监督学习

IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen
{"title":"针对分离视觉语音表征的音频引导自监督学习","authors":"Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen","doi":"10.1007/s11704-024-3787-8","DOIUrl":null,"url":null,"abstract":"<p>In this paper, we propose a novel two-branch framework to learn the disentangled visual speech representations based on two particular observations. Its main idea is to introduce the audio signal to guide the learning of speech-relevant cues and introduce a bottleneck to restrict the speech-irrelevant branch from learning high-frequency and fine-grained speech cues. Experiments on both the word-level and sentence-level audio-visual speech datasets LRW and LRS2-BBC show the effectiveness. Our future work is to explore more explicit auxiliary tasks and constraints beyond the reconstruction task of the speech-relevant and irrelevant branch to improve further its ability of capturing speech cues in the video. Meanwhile, it’s also a nice try to combine multiple types of knowledge representations [10] to further boost the obtained speech epresentations, which is also left for the future work.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"75 1","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Audio-guided self-supervised learning for disentangled visual speech representations\",\"authors\":\"Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen\",\"doi\":\"10.1007/s11704-024-3787-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In this paper, we propose a novel two-branch framework to learn the disentangled visual speech representations based on two particular observations. Its main idea is to introduce the audio signal to guide the learning of speech-relevant cues and introduce a bottleneck to restrict the speech-irrelevant branch from learning high-frequency and fine-grained speech cues. Experiments on both the word-level and sentence-level audio-visual speech datasets LRW and LRS2-BBC show the effectiveness. Our future work is to explore more explicit auxiliary tasks and constraints beyond the reconstruction task of the speech-relevant and irrelevant branch to improve further its ability of capturing speech cues in the video. Meanwhile, it’s also a nice try to combine multiple types of knowledge representations [10] to further boost the obtained speech epresentations, which is also left for the future work.</p>\",\"PeriodicalId\":12640,\"journal\":{\"name\":\"Frontiers of Computer Science\",\"volume\":\"75 1\",\"pages\":\"\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers of Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11704-024-3787-8\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11704-024-3787-8","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

在本文中,我们提出了一个新颖的双分支框架,基于两个特定的观察结果来学习分离的视觉语音表征。其主要思想是引入音频信号来引导语音相关线索的学习,并引入一个瓶颈来限制语音无关分支学习高频和细粒度语音线索。在单词级和句子级视听语音数据集 LRW 和 LRS2-BBC 上进行的实验显示了这种方法的有效性。我们未来的工作是在语音相关和不相关分支的重构任务之外,探索更明确的辅助任务和约束条件,以进一步提高其捕捉视频中语音线索的能力。同时,结合多种类型的知识表征[10]来进一步提高语音表征的效果也是一个不错的尝试,这也是未来工作的重点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Audio-guided self-supervised learning for disentangled visual speech representations

In this paper, we propose a novel two-branch framework to learn the disentangled visual speech representations based on two particular observations. Its main idea is to introduce the audio signal to guide the learning of speech-relevant cues and introduce a bottleneck to restrict the speech-irrelevant branch from learning high-frequency and fine-grained speech cues. Experiments on both the word-level and sentence-level audio-visual speech datasets LRW and LRS2-BBC show the effectiveness. Our future work is to explore more explicit auxiliary tasks and constraints beyond the reconstruction task of the speech-relevant and irrelevant branch to improve further its ability of capturing speech cues in the video. Meanwhile, it’s also a nice try to combine multiple types of knowledge representations [10] to further boost the obtained speech epresentations, which is also left for the future work.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Frontiers of Computer Science
Frontiers of Computer Science COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING
CiteScore
8.60
自引率
2.40%
发文量
799
审稿时长
6-12 weeks
期刊介绍: Frontiers of Computer Science aims to provide a forum for the publication of peer-reviewed papers to promote rapid communication and exchange between computer scientists. The journal publishes research papers and review articles in a wide range of topics, including: architecture, software, artificial intelligence, theoretical computer science, networks and communication, information systems, multimedia and graphics, information security, interdisciplinary, etc. The journal especially encourages papers from new emerging and multidisciplinary areas, as well as papers reflecting the international trends of research and development and on special topics reporting progress made by Chinese computer scientists.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信