VividWav2Lip：基于语音驱动的唇部同步生成高保真面部动画

IF 2.6 3区工程技术 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Electronics Pub Date : 2024-09-14 DOI:10.3390/electronics13183657

Li Liu, Jinhui Wang, Shijuan Chen, Zongmei Li

{"title":"VividWav2Lip：基于语音驱动的唇部同步生成高保真面部动画","authors":"Li Liu, Jinhui Wang, Shijuan Chen, Zongmei Li","doi":"10.3390/electronics13183657","DOIUrl":null,"url":null,"abstract":"Speech-driven lip synchronization is a crucial technology for generating realistic facial animations, with broad application prospects in virtual reality, education, training, and other fields. However, existing methods still face challenges in generating high-fidelity facial animations, particularly in addressing lip jitter and facial motion instability issues in continuous frame sequences. This study presents VividWav2Lip, an improved speech-driven lip synchronization model. Our model incorporates three key innovations: a cross-attention mechanism for enhanced audio-visual feature fusion, an optimized network structure with Squeeze-and-Excitation (SE) residual blocks, and the integration of the CodeFormer facial restoration network for post-processing. Extensive experiments were conducted on a diverse dataset comprising multiple languages and facial types. Quantitative evaluations demonstrate that VividWav2Lip outperforms the baseline Wav2Lip model by 5% in lip sync accuracy and image generation quality, with even more significant improvements over other mainstream methods. In subjective assessments, 85% of participants perceived VividWav2Lip-generated animations as more realistic compared to those produced by existing techniques. Additional experiments reveal our model’s robust cross-lingual performance, maintaining consistent quality even for languages not included in the training set. This study not only advances the theoretical foundations of audio-driven lip synchronization but also offers a practical solution for high-fidelity, multilingual dynamic face generation, with potential applications spanning virtual assistants, video dubbing, and personalized content creation.","PeriodicalId":11646,"journal":{"name":"Electronics","volume":"214 1","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization\",\"authors\":\"Li Liu, Jinhui Wang, Shijuan Chen, Zongmei Li\",\"doi\":\"10.3390/electronics13183657\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech-driven lip synchronization is a crucial technology for generating realistic facial animations, with broad application prospects in virtual reality, education, training, and other fields. However, existing methods still face challenges in generating high-fidelity facial animations, particularly in addressing lip jitter and facial motion instability issues in continuous frame sequences. This study presents VividWav2Lip, an improved speech-driven lip synchronization model. Our model incorporates three key innovations: a cross-attention mechanism for enhanced audio-visual feature fusion, an optimized network structure with Squeeze-and-Excitation (SE) residual blocks, and the integration of the CodeFormer facial restoration network for post-processing. Extensive experiments were conducted on a diverse dataset comprising multiple languages and facial types. Quantitative evaluations demonstrate that VividWav2Lip outperforms the baseline Wav2Lip model by 5% in lip sync accuracy and image generation quality, with even more significant improvements over other mainstream methods. In subjective assessments, 85% of participants perceived VividWav2Lip-generated animations as more realistic compared to those produced by existing techniques. Additional experiments reveal our model’s robust cross-lingual performance, maintaining consistent quality even for languages not included in the training set. This study not only advances the theoretical foundations of audio-driven lip synchronization but also offers a practical solution for high-fidelity, multilingual dynamic face generation, with potential applications spanning virtual assistants, video dubbing, and personalized content creation.\",\"PeriodicalId\":11646,\"journal\":{\"name\":\"Electronics\",\"volume\":\"214 1\",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Electronics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.3390/electronics13183657\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.3390/electronics13183657","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

语音驱动的唇部同步是生成逼真面部动画的关键技术，在虚拟现实、教育、培训等领域有着广阔的应用前景。然而，现有方法在生成高保真面部动画方面仍面临挑战，尤其是在解决连续帧序列中的唇部抖动和面部运动不稳定性问题方面。本研究提出了一种改进的语音驱动唇部同步模型 VividWav2Lip。我们的模型包含三项关键创新：用于增强视听特征融合的交叉注意机制、带有挤压-激发（SE）残差块的优化网络结构，以及用于后处理的 CodeFormer 面部修复网络的集成。我们在一个包含多种语言和面部类型的多样化数据集上进行了广泛的实验。定量评估结果表明，VividWav2Lip 在唇音同步准确率和图像生成质量方面比基准 Wav2Lip 模型高出 5%，比其他主流方法有更显著的改进。在主观评估中，85% 的参与者认为 VividWav2Lip 生成的动画比现有技术生成的动画更逼真。其他实验表明，我们的模型具有强大的跨语言性能，即使是训练集中未包含的语言也能保持稳定的质量。这项研究不仅推进了音频驱动唇语同步的理论基础，还为高保真、多语言动态人脸生成提供了实用的解决方案，其潜在应用领域包括虚拟助手、视频配音和个性化内容创建。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

Speech-driven lip synchronization is a crucial technology for generating realistic facial animations, with broad application prospects in virtual reality, education, training, and other fields. However, existing methods still face challenges in generating high-fidelity facial animations, particularly in addressing lip jitter and facial motion instability issues in continuous frame sequences. This study presents VividWav2Lip, an improved speech-driven lip synchronization model. Our model incorporates three key innovations: a cross-attention mechanism for enhanced audio-visual feature fusion, an optimized network structure with Squeeze-and-Excitation (SE) residual blocks, and the integration of the CodeFormer facial restoration network for post-processing. Extensive experiments were conducted on a diverse dataset comprising multiple languages and facial types. Quantitative evaluations demonstrate that VividWav2Lip outperforms the baseline Wav2Lip model by 5% in lip sync accuracy and image generation quality, with even more significant improvements over other mainstream methods. In subjective assessments, 85% of participants perceived VividWav2Lip-generated animations as more realistic compared to those produced by existing techniques. Additional experiments reveal our model’s robust cross-lingual performance, maintaining consistent quality even for languages not included in the training set. This study not only advances the theoretical foundations of audio-driven lip synchronization but also offers a practical solution for high-fidelity, multilingual dynamic face generation, with potential applications spanning virtual assistants, video dubbing, and personalized content creation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Electronics Computer Science-Computer Networks and Communications

CiteScore

1.10

自引率

10.30%

发文量

3515

审稿时长

16.71 days

期刊介绍： Electronics (ISSN 2079-9292; CODEN: ELECGJ) is an international, open access journal on the science of electronics and its applications published quarterly online by MDPI.