有嘴唇和身份先验的说话面孔一代

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computer Animation and Virtual Worlds Pub Date : 2025-05-28 DOI:10.1002/cav.70026

Jiajie Wu, Frederick W. B. Li, Gary K. L. Tam, Bailin Yang, Fangzhe Nan, Jiahao Pan

{"title":"有嘴唇和身份先验的说话面孔一代","authors":"Jiajie Wu, Frederick W. B. Li, Gary K. L. Tam, Bailin Yang, Fangzhe Nan, Jiahao Pan","doi":"10.1002/cav.70026","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Speech-driven talking face video generation has attracted growing interest in recent research. While person-specific approaches yield high-fidelity results, they require extensive training data from each individual speaker. In contrast, general-purpose methods often struggle with accurate lip synchronization, identity preservation, and natural facial movements. To address these limitations, we propose a novel architecture that combines an alignment model with a rendering model. The rendering model synthesizes identity-consistent lip movements by leveraging facial landmarks derived from speech, a partially occluded target face, multi-reference lip features, and the input audio. Concurrently, the alignment model estimates optical flow using the occluded face and a static reference image, enabling precise alignment of facial poses and lip shapes. This collaborative design enhances the rendering process, resulting in more realistic and identity-preserving outputs. Extensive experiments demonstrate that our method significantly improves lip synchronization and identity retention, establishing a new benchmark in talking face video generation.</p>\n </div>","PeriodicalId":50645,"journal":{"name":"Computer Animation and Virtual Worlds","volume":"36 3","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Talking Face Generation With Lip and Identity Priors\",\"authors\":\"Jiajie Wu, Frederick W. B. Li, Gary K. L. Tam, Bailin Yang, Fangzhe Nan, Jiahao Pan\",\"doi\":\"10.1002/cav.70026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Speech-driven talking face video generation has attracted growing interest in recent research. While person-specific approaches yield high-fidelity results, they require extensive training data from each individual speaker. In contrast, general-purpose methods often struggle with accurate lip synchronization, identity preservation, and natural facial movements. To address these limitations, we propose a novel architecture that combines an alignment model with a rendering model. The rendering model synthesizes identity-consistent lip movements by leveraging facial landmarks derived from speech, a partially occluded target face, multi-reference lip features, and the input audio. Concurrently, the alignment model estimates optical flow using the occluded face and a static reference image, enabling precise alignment of facial poses and lip shapes. This collaborative design enhances the rendering process, resulting in more realistic and identity-preserving outputs. Extensive experiments demonstrate that our method significantly improves lip synchronization and identity retention, establishing a new benchmark in talking face video generation.</p>\\n </div>\",\"PeriodicalId\":50645,\"journal\":{\"name\":\"Computer Animation and Virtual Worlds\",\"volume\":\"36 3\",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2025-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Animation and Virtual Worlds\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cav.70026\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Animation and Virtual Worlds","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cav.70026","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

语音驱动的说话脸视频生成在最近的研究中引起了越来越多的兴趣。虽然针对个人的方法可以产生高保真度的结果，但它们需要来自每个说话者的大量训练数据。相比之下，通用的方法往往与精确的嘴唇同步、身份保存和自然的面部运动作斗争。为了解决这些限制，我们提出了一种结合了对齐模型和呈现模型的新架构。该渲染模型通过利用来自语音的面部标志、部分遮挡的目标面部、多参考嘴唇特征和输入音频来综合身份一致的嘴唇运动。同时，对齐模型使用被遮挡的面部和静态参考图像估计光流，从而实现面部姿势和唇形的精确对齐。这种协作设计增强了渲染过程，从而产生更真实和保留身份的输出。大量的实验表明，我们的方法显著提高了唇部同步和身份保留，为语音人脸视频生成建立了新的基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Talking Face Generation With Lip and Identity Priors

查看原文本刊更多论文

Talking Face Generation With Lip and Identity Priors

Speech-driven talking face video generation has attracted growing interest in recent research. While person-specific approaches yield high-fidelity results, they require extensive training data from each individual speaker. In contrast, general-purpose methods often struggle with accurate lip synchronization, identity preservation, and natural facial movements. To address these limitations, we propose a novel architecture that combines an alignment model with a rendering model. The rendering model synthesizes identity-consistent lip movements by leveraging facial landmarks derived from speech, a partially occluded target face, multi-reference lip features, and the input audio. Concurrently, the alignment model estimates optical flow using the occluded face and a static reference image, enabling precise alignment of facial poses and lip shapes. This collaborative design enhances the rendering process, resulting in more realistic and identity-preserving outputs. Extensive experiments demonstrate that our method significantly improves lip synchronization and identity retention, establishing a new benchmark in talking face video generation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Animation and Virtual Worlds 工程技术-计算机：软件工程

CiteScore

2.20

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： With the advent of very powerful PCs and high-end graphics cards, there has been an incredible development in Virtual Worlds, real-time computer animation and simulation, games. But at the same time, new and cheaper Virtual Reality devices have appeared allowing an interaction with these real-time Virtual Worlds and even with real worlds through Augmented Reality. Three-dimensional characters, especially Virtual Humans are now of an exceptional quality, which allows to use them in the movie industry. But this is only a beginning, as with the development of Artificial Intelligence and Agent technology, these characters will become more and more autonomous and even intelligent. They will inhabit the Virtual Worlds in a Virtual Life together with animals and plants.