SPARK: Self-supervised Personalized Real-time Monocular Face Capture

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI:arxiv-2409.07984

Kelian Baert, Shrisha Bharadwaj, Fabien Castan, Benoit Maujean, Marc Christie, Victoria Abrevaya, Adnane Boukhayma

{"title":"SPARK: Self-supervised Personalized Real-time Monocular Face Capture","authors":"Kelian Baert, Shrisha Bharadwaj, Fabien Castan, Benoit Maujean, Marc Christie, Victoria Abrevaya, Adnane Boukhayma","doi":"arxiv-2409.07984","DOIUrl":null,"url":null,"abstract":"Feedforward monocular face capture methods seek to reconstruct posed faces\nfrom a single image of a person. Current state of the art approaches have the\nability to regress parametric 3D face models in real-time across a wide range\nof identities, lighting conditions and poses by leveraging large image datasets\nof human faces. These methods however suffer from clear limitations in that the\nunderlying parametric face model only provides a coarse estimation of the face\nshape, thereby limiting their practical applicability in tasks that require\nprecise 3D reconstruction (aging, face swapping, digital make-up, ...). In this\npaper, we propose a method for high-precision 3D face capture taking advantage\nof a collection of unconstrained videos of a subject as prior information. Our\nproposal builds on a two stage approach. We start with the reconstruction of a\ndetailed 3D face avatar of the person, capturing both precise geometry and\nappearance from a collection of videos. We then use the encoder from a\npre-trained monocular face reconstruction method, substituting its decoder with\nour personalized model, and proceed with transfer learning on the video\ncollection. Using our pre-estimated image formation model, we obtain a more\nprecise self-supervision objective, enabling improved expression and pose\nalignment. This results in a trained encoder capable of efficiently regressing\npose and expression parameters in real-time from previously unseen images,\nwhich combined with our personalized geometry model yields more accurate and\nhigh fidelity mesh inference. Through extensive qualitative and quantitative\nevaluation, we showcase the superiority of our final model as compared to\nstate-of-the-art baselines, and demonstrate its generalization ability to\nunseen pose, expression and lighting.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07984","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference. Through extensive qualitative and quantitative evaluation, we showcase the superiority of our final model as compared to state-of-the-art baselines, and demonstrate its generalization ability to unseen pose, expression and lighting.

查看原文本刊更多论文

SPARK：自我监督的个性化实时单目人脸捕捉

前馈单目人脸捕捉方法旨在从单张人脸图像中重建摆好姿势的人脸。目前最先进的方法能够利用大型人脸图像数据集，在各种身份、光照条件和姿势下实时回归参数化三维人脸模型。然而，这些方法存在明显的局限性，即所依据的参数化人脸模型只能提供对脸型的粗略估计，从而限制了它们在需要精确三维重建的任务（老化、换脸、数字化妆......）中的实际应用。在本文中，我们提出了一种高精度三维人脸捕捉方法，该方法利用了主体的无约束视频集合作为先验信息。我们的建议基于两个阶段的方法。首先，我们从视频集合中捕捉人物的精确几何形状和外貌，重建详细的三维人脸头像。然后，我们使用预先训练好的单眼人脸重建方法中的编码器，用我们的个性化模型代替其解码器，并在视频集合上进行迁移学习。利用我们预先估计的图像形成模型，我们获得了更精确的自我监督目标，从而改进了表情和姿势对齐。这样，训练有素的编码器就能从以前未见过的图像中实时有效地回归姿势和表情参数，再结合我们的个性化几何模型，就能获得更准确、保真度更高的网格推理。通过广泛的定性和定量评估，我们展示了我们的最终模型与最先进的基线模型相比的优越性，并证明了它对可见姿势、表情和光照的泛化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量