{"title":"Generating Talking Facial Videos Driven by Speech Using 3D Model and Motion Model","authors":"Fei Pan, Dejun Wang, Zonghua Hu, LongYang Yu","doi":"10.1109/ISCTIS58954.2023.10213134","DOIUrl":null,"url":null,"abstract":"Facial expression is one of the most important features of a face. In previous works on generating talking facial videos driven by speech without additional driving information, existing models struggled to directly learn the mapping from speech to facial features, resulting in poor quality of generated facial expressions. In this paper, we propose a method for generating speech-driven facial videos using 3D models and motion capture. This method demonstrates good performance in terms of model robustness, adaptation to large head poses, and improvement of fine-grained facial expression details. We learn features from speech, reconstruct the face by fitting 3DMM coefficients using speech, and employ a motion-captured based generative adversarial network to ensure clear facial texture details in the generated faces. On the publicly available dataset VoxCeleb2, our method achieves scores of 31.22 in PSNR, 0.89 in SSIM, 19.4 in FID, and 1.96 in F_LMD, outperforming other methods. On the MEAD dataset, our method achieves scores of 30.65 in PSNR, 0.68 in SSIM, 20.5 in FID, 2.36 in SyncNet, and 2.45 in F_LMD, outperforming other methods. Experimental results demonstrate that our method effectively enhances the model robustness for speech-driven facial video generation without additional driving information.","PeriodicalId":334790,"journal":{"name":"2023 3rd International Symposium on Computer Technology and Information Science (ISCTIS)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Symposium on Computer Technology and Information Science (ISCTIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCTIS58954.2023.10213134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Facial expression is one of the most important features of a face. In previous works on generating talking facial videos driven by speech without additional driving information, existing models struggled to directly learn the mapping from speech to facial features, resulting in poor quality of generated facial expressions. In this paper, we propose a method for generating speech-driven facial videos using 3D models and motion capture. This method demonstrates good performance in terms of model robustness, adaptation to large head poses, and improvement of fine-grained facial expression details. We learn features from speech, reconstruct the face by fitting 3DMM coefficients using speech, and employ a motion-captured based generative adversarial network to ensure clear facial texture details in the generated faces. On the publicly available dataset VoxCeleb2, our method achieves scores of 31.22 in PSNR, 0.89 in SSIM, 19.4 in FID, and 1.96 in F_LMD, outperforming other methods. On the MEAD dataset, our method achieves scores of 30.65 in PSNR, 0.68 in SSIM, 20.5 in FID, 2.36 in SyncNet, and 2.45 in F_LMD, outperforming other methods. Experimental results demonstrate that our method effectively enhances the model robustness for speech-driven facial video generation without additional driving information.