StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video

ACM SIGGRAPH 2023 Conference Proceedings Pub Date : 2023-05-01 DOI:10.1145/3588432.3591517

Lizhen Wang, Xiaochen Zhao, Jingxiang Sun, Yuxiang Zhang, Hongwen Zhang, Tao Yu, Yebin Liu

{"title":"StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video","authors":"Lizhen Wang, Xiaochen Zhao, Jingxiang Sun, Yuxiang Zhang, Hongwen Zhang, Tao Yu, Yebin Liu","doi":"10.1145/3588432.3591517","DOIUrl":null,"url":null,"abstract":"Face reenactment methods attempt to restore and re-animate portrait videos as realistically as possible. Existing methods face a dilemma in quality versus controllability: 2D GAN-based methods achieve higher image quality but suffer in fine-grained control of facial attributes compared with 3D counterparts. In this work, we propose StyleAvatar, a real-time photo-realistic portrait avatar reconstruction method using StyleGAN-based networks, which can generate high-fidelity portrait avatars with faithful expression control. We expand the capabilities of StyleGAN by introducing a compositional representation and a sliding window augmentation method, which enable faster convergence and improve translation generalization. Specifically, we divide the portrait scenes into three parts for adaptive adjustments: facial region, non-facial foreground region, and the background. Besides, our network leverages the best of UNet, StyleGAN and time coding for video learning, which enables high-quality video generation. Furthermore, a sliding window augmentation method together with a pre-training strategy are proposed to improve translation generalization and training performance, respectively. The proposed network can converge within two hours while ensuring high image quality and a forward rendering time of only 20 milliseconds. Furthermore, we propose a real-time live system, which further pushes research into applications. Results and experiments demonstrate the superiority of our method in terms of image quality, full portrait video generation, and real-time re-animation compared to existing facial reenactment methods. Training and inference code for this paper are at https://github.com/LizhenWangT/StyleAvatar.","PeriodicalId":280036,"journal":{"name":"ACM SIGGRAPH 2023 Conference Proceedings","volume":"229 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGGRAPH 2023 Conference Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3588432.3591517","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Face reenactment methods attempt to restore and re-animate portrait videos as realistically as possible. Existing methods face a dilemma in quality versus controllability: 2D GAN-based methods achieve higher image quality but suffer in fine-grained control of facial attributes compared with 3D counterparts. In this work, we propose StyleAvatar, a real-time photo-realistic portrait avatar reconstruction method using StyleGAN-based networks, which can generate high-fidelity portrait avatars with faithful expression control. We expand the capabilities of StyleGAN by introducing a compositional representation and a sliding window augmentation method, which enable faster convergence and improve translation generalization. Specifically, we divide the portrait scenes into three parts for adaptive adjustments: facial region, non-facial foreground region, and the background. Besides, our network leverages the best of UNet, StyleGAN and time coding for video learning, which enables high-quality video generation. Furthermore, a sliding window augmentation method together with a pre-training strategy are proposed to improve translation generalization and training performance, respectively. The proposed network can converge within two hours while ensuring high image quality and a forward rendering time of only 20 milliseconds. Furthermore, we propose a real-time live system, which further pushes research into applications. Results and experiments demonstrate the superiority of our method in terms of image quality, full portrait video generation, and real-time re-animation compared to existing facial reenactment methods. Training and inference code for this paper are at https://github.com/LizhenWangT/StyleAvatar.

查看原文本刊更多论文

StyleAvatar:实时照片逼真的肖像头像从一个单一的视频

面部再现方法试图尽可能真实地恢复和重新激活肖像视频。现有的方法面临着质量与可控性的两难境地:基于二维gan的方法可以获得更高的图像质量，但与三维方法相比，在面部属性的细粒度控制方面存在问题。在这项工作中，我们提出了StyleAvatar，这是一种基于stylegan网络的实时照片逼真肖像头像重建方法，可以生成具有忠实表情控制的高保真肖像头像。我们通过引入组合表示和滑动窗口增强方法来扩展StyleGAN的功能，从而实现更快的收敛和改进翻译泛化。具体来说，我们将人像场景分为三个部分进行自适应调整:面部区域、非面部前景区域和背景。此外，我们的网络利用最好的UNet, StyleGAN和时间编码视频学习，使高质量的视频生成。在此基础上，提出了滑动窗口增强方法和预训练策略，分别提高了翻译泛化和训练性能。该网络可以在2小时内收敛，同时保证高图像质量和前向渲染时间仅为20毫秒。此外，我们提出了一个实时的现场系统，进一步推动了研究的应用。结果和实验表明，与现有的面部再现方法相比，我们的方法在图像质量、全人像视频生成和实时再现动画方面具有优势。本文的训练和推理代码见https://github.com/LizhenWangT/StyleAvatar。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM SIGGRAPH 2023 Conference Proceedings

自引率

0.00%

发文量