Audio-Driven Talking Face Generation With Segmented Static Facial References for Customized Health Device Interactions

IF 10.9 2区计算机科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Consumer Electronics Pub Date : 2025-04-29 DOI:10.1109/TCE.2025.3565518

Zige Wang;Yashuai Wang;Tianyu Liu;Peng Zhang;Lei Xie;Yangming Guo

{"title":"Audio-Driven Talking Face Generation With Segmented Static Facial References for Customized Health Device Interactions","authors":"Zige Wang;Yashuai Wang;Tianyu Liu;Peng Zhang;Lei Xie;Yangming Guo","doi":"10.1109/TCE.2025.3565518","DOIUrl":null,"url":null,"abstract":"In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.","PeriodicalId":13208,"journal":{"name":"IEEE Transactions on Consumer Electronics","volume":"71 2","pages":"5404-5413"},"PeriodicalIF":10.9000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Consumer Electronics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10980001/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.

查看原文本刊更多论文

音频驱动的说话脸生成与分段静态面部参考自定义的健康设备交互

在各种人机交互（HMI）应用中，基于音频驱动的谈话脸生成的高级技术经常受到时间偏差和低质量输出问题的挑战。最近的解决方案试图通过最大化视听对之间的相似性来改善同步。然而，在推理阶段引入的时间干扰继续限制生成性能的增强。基于分割后的静态面部图像与稳定的外观表示之间的内在联系，本研究提出了手动时间分割（MTS）和静态面部参考（SFR）两种策略来提高推理阶段的性能。相应的功能包括：MTS涉及将输入视频分割成几个片段，有效地降低了推理过程的复杂性，SFR利用静态面部参考来减轻动态序列产生的时间噪声，从而提高生成输出的质量。在LRS2和VoxCeleb2数据集上的大量实验表明，所提出的策略能够在不改变网络架构或训练策略的情况下，显著提高LSE-C和LSE-D指标的推理性能。为了在实际场景应用程序中验证有效性，还使用建议的解决方案在医疗保健设备上进行了部署。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Consumer Electronics 工程技术-电信学

CiteScore

7.70

自引率

9.30%

发文量

审稿时长

3.3 months

期刊介绍： The main focus for the IEEE Transactions on Consumer Electronics is the engineering and research aspects of the theory, design, construction, manufacture or end use of mass market electronics, systems, software and services for consumers.