时空优化的音频驱动人脸生成技术

IF 2.7 4区计算机科学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computer Graphics Forum Pub Date : 2024-10-24 DOI:10.1111/cgf.15228

Biao Dong, Bo-Yao Ma, Lei Zhang

{"title":"时空优化的音频驱动人脸生成技术","authors":"Biao Dong, Bo-Yao Ma, Lei Zhang","doi":"10.1111/cgf.15228","DOIUrl":null,"url":null,"abstract":"<p>Audio-driven talking face generation is essentially a cross-modal mapping from audio to video frames. The main challenge lies in the intricate one-to-many mapping, which affects lip sync accuracy. And the loss of facial details during image reconstruction often results in visual artifacts in the generated video. To overcome these challenges, this paper proposes to enhance the quality of generated talking faces with a new spatio-temporal consistency. Specifically, the temporal consistency is achieved through consecutive frames of the each phoneme, which form temporal modules that exhibit similar lip appearance changes. This allows for adaptive adjustment in the lip movement for accurate sync. The spatial consistency pertains to the uniform distribution of textures within local regions, which form spatial modules and regulate the texture distribution in the generator. This yields fine details in the reconstructed facial images. Extensive experiments show that our method can generate more natural talking faces than previous state-of-the-art methods in both accurate lip sync and realistic facial details.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"43 7","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spatially and Temporally Optimized Audio-Driven Talking Face Generation\",\"authors\":\"Biao Dong, Bo-Yao Ma, Lei Zhang\",\"doi\":\"10.1111/cgf.15228\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Audio-driven talking face generation is essentially a cross-modal mapping from audio to video frames. The main challenge lies in the intricate one-to-many mapping, which affects lip sync accuracy. And the loss of facial details during image reconstruction often results in visual artifacts in the generated video. To overcome these challenges, this paper proposes to enhance the quality of generated talking faces with a new spatio-temporal consistency. Specifically, the temporal consistency is achieved through consecutive frames of the each phoneme, which form temporal modules that exhibit similar lip appearance changes. This allows for adaptive adjustment in the lip movement for accurate sync. The spatial consistency pertains to the uniform distribution of textures within local regions, which form spatial modules and regulate the texture distribution in the generator. This yields fine details in the reconstructed facial images. Extensive experiments show that our method can generate more natural talking faces than previous state-of-the-art methods in both accurate lip sync and realistic facial details.</p>\",\"PeriodicalId\":10687,\"journal\":{\"name\":\"Computer Graphics Forum\",\"volume\":\"43 7\",\"pages\":\"\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2024-10-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Graphics Forum\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/cgf.15228\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Graphics Forum","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/cgf.15228","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

音频驱动的人脸识别本质上是从音频到视频帧的跨模态映射。主要挑战在于复杂的一对多映射，这会影响唇语同步的准确性。而图像重建过程中面部细节的丢失往往会导致生成的视频出现视觉假象。为了克服这些挑战，本文提出用一种新的时空一致性来提高生成的会说话人脸的质量。具体来说，时空一致性是通过每个音素的连续帧来实现的，这些连续帧形成的时空模块表现出相似的唇部外观变化。这样就可以自适应地调整嘴唇的运动，实现准确的同步。空间一致性与局部区域内纹理的均匀分布有关，这些纹理形成空间模块，并调节生成器中的纹理分布。这就产生了重建面部图像中的精细细节。广泛的实验表明，我们的方法在准确的唇部同步和逼真的面部细节方面都能生成比以往最先进的方法更自然的说话人脸。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Spatially and Temporally Optimized Audio-Driven Talking Face Generation

Audio-driven talking face generation is essentially a cross-modal mapping from audio to video frames. The main challenge lies in the intricate one-to-many mapping, which affects lip sync accuracy. And the loss of facial details during image reconstruction often results in visual artifacts in the generated video. To overcome these challenges, this paper proposes to enhance the quality of generated talking faces with a new spatio-temporal consistency. Specifically, the temporal consistency is achieved through consecutive frames of the each phoneme, which form temporal modules that exhibit similar lip appearance changes. This allows for adaptive adjustment in the lip movement for accurate sync. The spatial consistency pertains to the uniform distribution of textures within local regions, which form spatial modules and regulate the texture distribution in the generator. This yields fine details in the reconstructed facial images. Extensive experiments show that our method can generate more natural talking faces than previous state-of-the-art methods in both accurate lip sync and realistic facial details.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Graphics Forum 工程技术-计算机：软件工程

CiteScore

5.80

自引率

12.00%

发文量

175

审稿时长

3-6 weeks

期刊介绍： Computer Graphics Forum is the official journal of Eurographics, published in cooperation with Wiley-Blackwell, and is a unique, international source of information for computer graphics professionals interested in graphics developments worldwide. It is now one of the leading journals for researchers, developers and users of computer graphics in both commercial and academic environments. The journal reports on the latest developments in the field throughout the world and covers all aspects of the theory, practice and application of computer graphics.