基于动作单元的多模态情感说话脸生成

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-26 DOI:10.1109/TCSVT.2024.3523359

Jiayi Lyu;Xing Lan;Guohong Hu;Hanyu Jiang;Wei Gan;Jinbao Wang;Jian Xue

{"title":"基于动作单元的多模态情感说话脸生成","authors":"Jiayi Lyu;Xing Lan;Guohong Hu;Hanyu Jiang;Wei Gan;Jinbao Wang;Jian Xue","doi":"10.1109/TCSVT.2024.3523359","DOIUrl":null,"url":null,"abstract":"Talking face generation focuses on creating natural facial animations that align with the provided text or audio input. Current methods in this field primarily rely on facial landmarks to convey emotional changes. However, spatial key-points are valuable, yet limited in capturing the intricate dynamics and subtle nuances of emotional expressions due to their restricted spatial coverage. Consequently, this reliance on sparse landmarks can result in decreased accuracy and visual quality, especially when representing complex emotional states. To address this issue, we propose a novel method called Emotional Talking with Action Unit (ETAU), which seamlessly integrates facial Action Units (AUs) into the generation process. Unlike previous works that solely rely on facial landmarks, ETAU employs both Action Units and landmarks to comprehensively represent facial expressions through interpretable representations. Our method provides a detailed and dynamic representation of emotions by capturing the complex interactions among facial muscle movements. Moreover, ETAU adopts a multi-modal strategy by seamlessly integrating emotion prompts, driving videos, and target images, and by leveraging various input data effectively, it generates highly realistic and emotional talking-face videos. Through extensive evaluations across multiple datasets, including MEAD, LRW, GRID and HDTF, ETAU outperforms previous methods, showcasing its superior ability to generate high-quality, expressive talking faces with improved visual fidelity and synchronization. Moreover, ETAU exhibits a significant improvement on the emotion accuracy of the generated results, reaching an impressive average accuracy of 84% on the MEAD dataset.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4026-4038"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal Emotional Talking Face Generation Based on Action Units\",\"authors\":\"Jiayi Lyu;Xing Lan;Guohong Hu;Hanyu Jiang;Wei Gan;Jinbao Wang;Jian Xue\",\"doi\":\"10.1109/TCSVT.2024.3523359\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Talking face generation focuses on creating natural facial animations that align with the provided text or audio input. Current methods in this field primarily rely on facial landmarks to convey emotional changes. However, spatial key-points are valuable, yet limited in capturing the intricate dynamics and subtle nuances of emotional expressions due to their restricted spatial coverage. Consequently, this reliance on sparse landmarks can result in decreased accuracy and visual quality, especially when representing complex emotional states. To address this issue, we propose a novel method called Emotional Talking with Action Unit (ETAU), which seamlessly integrates facial Action Units (AUs) into the generation process. Unlike previous works that solely rely on facial landmarks, ETAU employs both Action Units and landmarks to comprehensively represent facial expressions through interpretable representations. Our method provides a detailed and dynamic representation of emotions by capturing the complex interactions among facial muscle movements. Moreover, ETAU adopts a multi-modal strategy by seamlessly integrating emotion prompts, driving videos, and target images, and by leveraging various input data effectively, it generates highly realistic and emotional talking-face videos. Through extensive evaluations across multiple datasets, including MEAD, LRW, GRID and HDTF, ETAU outperforms previous methods, showcasing its superior ability to generate high-quality, expressive talking faces with improved visual fidelity and synchronization. Moreover, ETAU exhibits a significant improvement on the emotion accuracy of the generated results, reaching an impressive average accuracy of 84% on the MEAD dataset.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"4026-4038\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-12-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10816597/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10816597/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

说话脸生成侧重于创建与提供的文本或音频输入一致的自然面部动画。目前该领域的方法主要依靠面部标志来传达情绪变化。然而，空间关键点是有价值的，但由于其有限的空间覆盖范围，在捕捉复杂的动态和微妙的情感表达方面受到限制。因此，这种对稀疏地标的依赖可能导致准确性和视觉质量下降，特别是在表示复杂的情绪状态时。为了解决这一问题，我们提出了一种新的方法，称为与动作单元的情感对话（ETAU），该方法将面部动作单元（AUs）无缝集成到生成过程中。与以往的作品仅仅依靠面部地标不同，ETAU同时使用动作单元和地标，通过可解释的表征来全面表征面部表情。我们的方法通过捕捉面部肌肉运动之间复杂的相互作用，提供了情绪的详细和动态表示。ETAU采用多模态策略，将情绪提示、驾驶视频和目标图像无缝整合，有效利用各种输入数据，生成高度逼真、情感化的说话脸视频。通过对多个数据集（包括MEAD、LRW、GRID和HDTF）的广泛评估，ETAU超越了以前的方法，展示了其生成高质量、富有表现力的说话脸的卓越能力，并改善了视觉保真度和同步。此外，ETAU在生成结果的情感准确性方面表现出显著的改进，在MEAD数据集上达到了令人印象深刻的84%的平均准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal Emotional Talking Face Generation Based on Action Units

Talking face generation focuses on creating natural facial animations that align with the provided text or audio input. Current methods in this field primarily rely on facial landmarks to convey emotional changes. However, spatial key-points are valuable, yet limited in capturing the intricate dynamics and subtle nuances of emotional expressions due to their restricted spatial coverage. Consequently, this reliance on sparse landmarks can result in decreased accuracy and visual quality, especially when representing complex emotional states. To address this issue, we propose a novel method called Emotional Talking with Action Unit (ETAU), which seamlessly integrates facial Action Units (AUs) into the generation process. Unlike previous works that solely rely on facial landmarks, ETAU employs both Action Units and landmarks to comprehensively represent facial expressions through interpretable representations. Our method provides a detailed and dynamic representation of emotions by capturing the complex interactions among facial muscle movements. Moreover, ETAU adopts a multi-modal strategy by seamlessly integrating emotion prompts, driving videos, and target images, and by leveraging various input data effectively, it generates highly realistic and emotional talking-face videos. Through extensive evaluations across multiple datasets, including MEAD, LRW, GRID and HDTF, ETAU outperforms previous methods, showcasing its superior ability to generate high-quality, expressive talking faces with improved visual fidelity and synchronization. Moreover, ETAU exhibits a significant improvement on the emotion accuracy of the generated results, reaching an impressive average accuracy of 84% on the MEAD dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.