Audio-Driven Talking Face Generation With Segmented Static Facial References for Customized Health Device Interactions

IF 10.9 2区 计算机科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Zige Wang;Yashuai Wang;Tianyu Liu;Peng Zhang;Lei Xie;Yangming Guo
{"title":"Audio-Driven Talking Face Generation With Segmented Static Facial References for Customized Health Device Interactions","authors":"Zige Wang;Yashuai Wang;Tianyu Liu;Peng Zhang;Lei Xie;Yangming Guo","doi":"10.1109/TCE.2025.3565518","DOIUrl":null,"url":null,"abstract":"In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.","PeriodicalId":13208,"journal":{"name":"IEEE Transactions on Consumer Electronics","volume":"71 2","pages":"5404-5413"},"PeriodicalIF":10.9000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Consumer Electronics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10980001/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.
音频驱动的说话脸生成与分段静态面部参考自定义的健康设备交互
在各种人机交互(HMI)应用中,基于音频驱动的谈话脸生成的高级技术经常受到时间偏差和低质量输出问题的挑战。最近的解决方案试图通过最大化视听对之间的相似性来改善同步。然而,在推理阶段引入的时间干扰继续限制生成性能的增强。基于分割后的静态面部图像与稳定的外观表示之间的内在联系,本研究提出了手动时间分割(MTS)和静态面部参考(SFR)两种策略来提高推理阶段的性能。相应的功能包括:MTS涉及将输入视频分割成几个片段,有效地降低了推理过程的复杂性,SFR利用静态面部参考来减轻动态序列产生的时间噪声,从而提高生成输出的质量。在LRS2和VoxCeleb2数据集上的大量实验表明,所提出的策略能够在不改变网络架构或训练策略的情况下,显著提高LSE-C和LSE-D指标的推理性能。为了在实际场景应用程序中验证有效性,还使用建议的解决方案在医疗保健设备上进行了部署。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.70
自引率
9.30%
发文量
59
审稿时长
3.3 months
期刊介绍: The main focus for the IEEE Transactions on Consumer Electronics is the engineering and research aspects of the theory, design, construction, manufacture or end use of mass market electronics, systems, software and services for consumers.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信