StableFace：分析和改进运动稳定性以生成会说话的人脸

IF 8.7 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal of Selected Topics in Signal Processing Pub Date : 2023-11-16 DOI:10.1109/JSTSP.2023.3333552

Jun Ling;Xu Tan;Liyang Chen;Runnan Li;Yuchao Zhang;Sheng Zhao;Li Song

{"title":"StableFace：分析和改进运动稳定性以生成会说话的人脸","authors":"Jun Ling;Xu Tan;Liyang Chen;Runnan Li;Yuchao Zhang;Sheng Zhao;Li Song","doi":"10.1109/JSTSP.2023.3333552","DOIUrl":null,"url":null,"abstract":"While previous methods for speech-driven talking face generation have shown significant advances in improving the visual and lip-sync quality of the synthesized videos, they have paid less attention to lip motion jitters which can substantially undermine the perceived quality of talking face videos. What causes motion jitters, and how to mitigate the problem? In this article, we conduct systematic analyses to investigate the motion jittering problem based on a state-of-the-art pipeline that utilizes 3D face representations to bridge the input audio and output video, and implement several effective designs to improve motion stability. This study finds that several factors can lead to jitters in the synthesized talking face video, including jitters from the input face representations, training-inference mismatch, and a lack of dependency modeling in the generation network. Accordingly, we propose three effective solutions: 1) a Gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) augmented erosions added to the input data of the neural renderer in training to simulate the inference distortion to reduce mismatch; 3) an audio-fused transformer generator to model inter-frame dependency. In addition, considering there is no off-the-shelf metric that can measures motion jitters of talking face video, we devise an objective metric (Motion Stability Index, MSI) to quantitatively measure the motion jitters. Extensive experimental results show the superiority of the proposed method on motion-stable talking video generation, with superior quality to previous systems.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"17 6","pages":"1232-1247"},"PeriodicalIF":8.7000,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"StableFace: Analyzing and Improving Motion Stability for Talking Face Generation\",\"authors\":\"Jun Ling;Xu Tan;Liyang Chen;Runnan Li;Yuchao Zhang;Sheng Zhao;Li Song\",\"doi\":\"10.1109/JSTSP.2023.3333552\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While previous methods for speech-driven talking face generation have shown significant advances in improving the visual and lip-sync quality of the synthesized videos, they have paid less attention to lip motion jitters which can substantially undermine the perceived quality of talking face videos. What causes motion jitters, and how to mitigate the problem? In this article, we conduct systematic analyses to investigate the motion jittering problem based on a state-of-the-art pipeline that utilizes 3D face representations to bridge the input audio and output video, and implement several effective designs to improve motion stability. This study finds that several factors can lead to jitters in the synthesized talking face video, including jitters from the input face representations, training-inference mismatch, and a lack of dependency modeling in the generation network. Accordingly, we propose three effective solutions: 1) a Gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) augmented erosions added to the input data of the neural renderer in training to simulate the inference distortion to reduce mismatch; 3) an audio-fused transformer generator to model inter-frame dependency. In addition, considering there is no off-the-shelf metric that can measures motion jitters of talking face video, we devise an objective metric (Motion Stability Index, MSI) to quantitatively measure the motion jitters. Extensive experimental results show the superiority of the proposed method on motion-stable talking video generation, with superior quality to previous systems.\",\"PeriodicalId\":13038,\"journal\":{\"name\":\"IEEE Journal of Selected Topics in Signal Processing\",\"volume\":\"17 6\",\"pages\":\"1232-1247\"},\"PeriodicalIF\":8.7000,\"publicationDate\":\"2023-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Selected Topics in Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10319685/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10319685/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

虽然以前的语音驱动人脸识别方法在提高合成视频的视觉和唇音质量方面取得了显著进步，但却较少关注唇部运动抖动问题，而唇部运动抖动会严重影响人脸识别视频的感知质量。是什么导致了运动抖动？在本文中，我们基于最先进的管道（该管道利用三维人脸表征来连接输入音频和输出视频）进行了系统分析，以研究运动抖动问题，并实施了几种有效的设计来提高运动稳定性。这项研究发现，有几个因素会导致合成的会说话人脸视频出现抖动，包括输入人脸表征的抖动、训练-推理不匹配以及生成网络中缺乏依赖性建模。因此，我们提出了三种有效的解决方案：1) 基于高斯的自适应平滑模块来平滑三维人脸表征，以消除输入中的抖动；2) 在训练中向神经渲染器的输入数据添加增强侵蚀，以模拟推理失真，从而减少不匹配；3) 音频融合变压器生成器来模拟帧间依赖性。此外，考虑到目前还没有现成的指标可以测量说话人脸视频的运动抖动，我们设计了一个客观指标（运动稳定指数，MSI）来定量测量运动抖动。广泛的实验结果表明，所提出的方法在生成运动稳定的谈话视频方面具有优越性，其质量优于以前的系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

While previous methods for speech-driven talking face generation have shown significant advances in improving the visual and lip-sync quality of the synthesized videos, they have paid less attention to lip motion jitters which can substantially undermine the perceived quality of talking face videos. What causes motion jitters, and how to mitigate the problem? In this article, we conduct systematic analyses to investigate the motion jittering problem based on a state-of-the-art pipeline that utilizes 3D face representations to bridge the input audio and output video, and implement several effective designs to improve motion stability. This study finds that several factors can lead to jitters in the synthesized talking face video, including jitters from the input face representations, training-inference mismatch, and a lack of dependency modeling in the generation network. Accordingly, we propose three effective solutions: 1) a Gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) augmented erosions added to the input data of the neural renderer in training to simulate the inference distortion to reduce mismatch; 3) an audio-fused transformer generator to model inter-frame dependency. In addition, considering there is no off-the-shelf metric that can measures motion jitters of talking face video, we devise an objective metric (Motion Stability Index, MSI) to quantitatively measure the motion jitters. Extensive experimental results show the superiority of the proposed method on motion-stable talking video generation, with superior quality to previous systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal of Selected Topics in Signal Processing 工程技术-工程：电子与电气

CiteScore

19.00

自引率

1.30%

发文量

135

审稿时长

3 months

期刊介绍： The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others. The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.