CA-Wav2Lip:在野外协调基于注意的语音到嘴唇合成

Kuan-Chien Wang, J. Zhang, Jingquan Huang, Qi Li, Minmin Sun, Kazuya Sakai, Wei-Shinn Ku
{"title":"CA-Wav2Lip:在野外协调基于注意的语音到嘴唇合成","authors":"Kuan-Chien Wang, J. Zhang, Jingquan Huang, Qi Li, Minmin Sun, Kazuya Sakai, Wei-Shinn Ku","doi":"10.1109/SMARTCOMP58114.2023.00018","DOIUrl":null,"url":null,"abstract":"With the growing consumption of online visual contents, there is an urgent need for video translation in order to reach a wider audience from around the world. However, the materials after direct translation and dubbing are unable to create a natural audio-visual experience since the translated speech and lip movement are often out of sync. To improve the viewing experience, an accurate automatic lip-movement synchronization generation system is necessary. To improve the accuracy and visual quality of speech to lip generation, this research proposes two techniques: Embedding Attention Mechanisms in Convolution Layers and Deploying SSIM as Loss Function in Visual Quality Discriminator. The proposed system as well as several other ones are tested on three audiovisual datasets. The results show that our proposed methods achieve superior performance over the state-of-the-art speech to lip synthesis on not only the accuracy but also the visual quality of audio-lip synchronization generation.","PeriodicalId":163556,"journal":{"name":"2023 IEEE International Conference on Smart Computing (SMARTCOMP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CA-Wav2Lip: Coordinate Attention-based Speech To Lip Synthesis In The Wild\",\"authors\":\"Kuan-Chien Wang, J. Zhang, Jingquan Huang, Qi Li, Minmin Sun, Kazuya Sakai, Wei-Shinn Ku\",\"doi\":\"10.1109/SMARTCOMP58114.2023.00018\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the growing consumption of online visual contents, there is an urgent need for video translation in order to reach a wider audience from around the world. However, the materials after direct translation and dubbing are unable to create a natural audio-visual experience since the translated speech and lip movement are often out of sync. To improve the viewing experience, an accurate automatic lip-movement synchronization generation system is necessary. To improve the accuracy and visual quality of speech to lip generation, this research proposes two techniques: Embedding Attention Mechanisms in Convolution Layers and Deploying SSIM as Loss Function in Visual Quality Discriminator. The proposed system as well as several other ones are tested on three audiovisual datasets. The results show that our proposed methods achieve superior performance over the state-of-the-art speech to lip synthesis on not only the accuracy but also the visual quality of audio-lip synchronization generation.\",\"PeriodicalId\":163556,\"journal\":{\"name\":\"2023 IEEE International Conference on Smart Computing (SMARTCOMP)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Conference on Smart Computing (SMARTCOMP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SMARTCOMP58114.2023.00018\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Smart Computing (SMARTCOMP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SMARTCOMP58114.2023.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

随着在线视觉内容消费的增长,迫切需要视频翻译,以便接触到来自世界各地的更广泛的受众。然而,直接翻译和配音后的材料,由于翻译后的言语和唇部运动往往不同步,无法创造出自然的视听体验。为了提高观看体验,需要精确的自动唇动同步生成系统。为了提高语音到嘴唇生成的准确性和视觉质量,本研究提出了两种技术:在卷积层中嵌入注意机制和在视觉质量鉴别器中部署SSIM作为损失函数。在三个视听数据集上对所提出的系统以及其他几个系统进行了测试。结果表明,我们提出的方法在音频-唇同步生成的精度和视觉质量上都优于目前最先进的语音-唇合成方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CA-Wav2Lip: Coordinate Attention-based Speech To Lip Synthesis In The Wild
With the growing consumption of online visual contents, there is an urgent need for video translation in order to reach a wider audience from around the world. However, the materials after direct translation and dubbing are unable to create a natural audio-visual experience since the translated speech and lip movement are often out of sync. To improve the viewing experience, an accurate automatic lip-movement synchronization generation system is necessary. To improve the accuracy and visual quality of speech to lip generation, this research proposes two techniques: Embedding Attention Mechanisms in Convolution Layers and Deploying SSIM as Loss Function in Visual Quality Discriminator. The proposed system as well as several other ones are tested on three audiovisual datasets. The results show that our proposed methods achieve superior performance over the state-of-the-art speech to lip synthesis on not only the accuracy but also the visual quality of audio-lip synchronization generation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信