{"title":"Multimodal Emotional Talking Face Generation Based on Action Units","authors":"Jiayi Lyu;Xing Lan;Guohong Hu;Hanyu Jiang;Wei Gan;Jinbao Wang;Jian Xue","doi":"10.1109/TCSVT.2024.3523359","DOIUrl":null,"url":null,"abstract":"Talking face generation focuses on creating natural facial animations that align with the provided text or audio input. Current methods in this field primarily rely on facial landmarks to convey emotional changes. However, spatial key-points are valuable, yet limited in capturing the intricate dynamics and subtle nuances of emotional expressions due to their restricted spatial coverage. Consequently, this reliance on sparse landmarks can result in decreased accuracy and visual quality, especially when representing complex emotional states. To address this issue, we propose a novel method called Emotional Talking with Action Unit (ETAU), which seamlessly integrates facial Action Units (AUs) into the generation process. Unlike previous works that solely rely on facial landmarks, ETAU employs both Action Units and landmarks to comprehensively represent facial expressions through interpretable representations. Our method provides a detailed and dynamic representation of emotions by capturing the complex interactions among facial muscle movements. Moreover, ETAU adopts a multi-modal strategy by seamlessly integrating emotion prompts, driving videos, and target images, and by leveraging various input data effectively, it generates highly realistic and emotional talking-face videos. Through extensive evaluations across multiple datasets, including MEAD, LRW, GRID and HDTF, ETAU outperforms previous methods, showcasing its superior ability to generate high-quality, expressive talking faces with improved visual fidelity and synchronization. Moreover, ETAU exhibits a significant improvement on the emotion accuracy of the generated results, reaching an impressive average accuracy of 84% on the MEAD dataset.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4026-4038"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10816597/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Talking face generation focuses on creating natural facial animations that align with the provided text or audio input. Current methods in this field primarily rely on facial landmarks to convey emotional changes. However, spatial key-points are valuable, yet limited in capturing the intricate dynamics and subtle nuances of emotional expressions due to their restricted spatial coverage. Consequently, this reliance on sparse landmarks can result in decreased accuracy and visual quality, especially when representing complex emotional states. To address this issue, we propose a novel method called Emotional Talking with Action Unit (ETAU), which seamlessly integrates facial Action Units (AUs) into the generation process. Unlike previous works that solely rely on facial landmarks, ETAU employs both Action Units and landmarks to comprehensively represent facial expressions through interpretable representations. Our method provides a detailed and dynamic representation of emotions by capturing the complex interactions among facial muscle movements. Moreover, ETAU adopts a multi-modal strategy by seamlessly integrating emotion prompts, driving videos, and target images, and by leveraging various input data effectively, it generates highly realistic and emotional talking-face videos. Through extensive evaluations across multiple datasets, including MEAD, LRW, GRID and HDTF, ETAU outperforms previous methods, showcasing its superior ability to generate high-quality, expressive talking faces with improved visual fidelity and synchronization. Moreover, ETAU exhibits a significant improvement on the emotion accuracy of the generated results, reaching an impressive average accuracy of 84% on the MEAD dataset.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.