{"title":"音频到面部地标生成器说话的脸视频合成","authors":"Dasol Jeong, Injae Lee, J. Paik","doi":"10.1109/ICEIC57457.2023.10049847","DOIUrl":null,"url":null,"abstract":"Audio driven talking face methods have been studied to process the accuracy lip synchronization. However, how to create movement of head poses and personalized facial features is a challenging problem. In order to solve this problem, it is necessary to identify the context based on the audio, create the head pose and lip motion, and synthesize the personalized face. We introduce a facial landmark generation method including audio-based head pose and lip motion using an audio transformer. The audio transformer extracts audio features containing contextual information and creates generalized head pose and lip motion landmarks. In order to synthesize personalized features on the generated landmarks, a talking face video is generated by applying the method learned through meta-learning. With just a few single images, even unknown faces can be spoken in the audio you want. In addition, the proposed method is applicable to various languages, and enables photo-realistic synthesis and fast inference.","PeriodicalId":373752,"journal":{"name":"2023 International Conference on Electronics, Information, and Communication (ICEIC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Audio-to-Facial Landmarks Generator for Talking Face Video Synthesis\",\"authors\":\"Dasol Jeong, Injae Lee, J. Paik\",\"doi\":\"10.1109/ICEIC57457.2023.10049847\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio driven talking face methods have been studied to process the accuracy lip synchronization. However, how to create movement of head poses and personalized facial features is a challenging problem. In order to solve this problem, it is necessary to identify the context based on the audio, create the head pose and lip motion, and synthesize the personalized face. We introduce a facial landmark generation method including audio-based head pose and lip motion using an audio transformer. The audio transformer extracts audio features containing contextual information and creates generalized head pose and lip motion landmarks. In order to synthesize personalized features on the generated landmarks, a talking face video is generated by applying the method learned through meta-learning. With just a few single images, even unknown faces can be spoken in the audio you want. In addition, the proposed method is applicable to various languages, and enables photo-realistic synthesis and fast inference.\",\"PeriodicalId\":373752,\"journal\":{\"name\":\"2023 International Conference on Electronics, Information, and Communication (ICEIC)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Electronics, Information, and Communication (ICEIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEIC57457.2023.10049847\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Electronics, Information, and Communication (ICEIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEIC57457.2023.10049847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Audio-to-Facial Landmarks Generator for Talking Face Video Synthesis
Audio driven talking face methods have been studied to process the accuracy lip synchronization. However, how to create movement of head poses and personalized facial features is a challenging problem. In order to solve this problem, it is necessary to identify the context based on the audio, create the head pose and lip motion, and synthesize the personalized face. We introduce a facial landmark generation method including audio-based head pose and lip motion using an audio transformer. The audio transformer extracts audio features containing contextual information and creates generalized head pose and lip motion landmarks. In order to synthesize personalized features on the generated landmarks, a talking face video is generated by applying the method learned through meta-learning. With just a few single images, even unknown faces can be spoken in the audio you want. In addition, the proposed method is applicable to various languages, and enables photo-realistic synthesis and fast inference.