{"title":"Instant Multi-View Head Capture through Learnable Registration","authors":"Timo Bolkart, Tianye Li, Michael J. Black","doi":"10.1109/CVPR52729.2023.00081","DOIUrl":null,"url":null,"abstract":"Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow and commonly address the problem in two separate steps; multi-view stereo (MVS) reconstruction followed by non-rigid registration. To simplify this process, we introduce TEMPEH (Towards Estimation of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads in dense correspondence from calibrated multi-view images. Registering datasets of 3D scans typically requires manual parameter tuning to find the right balance between accurately fitting the scans' surfaces and being robust to scanning noise and outliers. Instead, we propose to jointly register a 3D head dataset while training TEMPEH. Specifically, during training, we minimize a geometric loss commonly used for surface registration, effectively leveraging TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric feature representation that samples and fuses features from each view using camera calibration information. To account for partial occlusions and a large capture volume that enables head movements, we use view-and surface-aware feature fusion, and a spatial transformer-based head localization module, respectively. We use raw MVS scans as supervision during training, but, once trained, TEMPEH directly predicts 3D heads in dense correspondence without requiring scans. Predicting one head takes about 0.3 seconds with a median reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art. This enables the efficient capture of large datasets containing multiple people and diverse facial motions. Code, model, and data are publicly available at https://tempeh.is.tue.mpg.de.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52729.2023.00081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow and commonly address the problem in two separate steps; multi-view stereo (MVS) reconstruction followed by non-rigid registration. To simplify this process, we introduce TEMPEH (Towards Estimation of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads in dense correspondence from calibrated multi-view images. Registering datasets of 3D scans typically requires manual parameter tuning to find the right balance between accurately fitting the scans' surfaces and being robust to scanning noise and outliers. Instead, we propose to jointly register a 3D head dataset while training TEMPEH. Specifically, during training, we minimize a geometric loss commonly used for surface registration, effectively leveraging TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric feature representation that samples and fuses features from each view using camera calibration information. To account for partial occlusions and a large capture volume that enables head movements, we use view-and surface-aware feature fusion, and a spatial transformer-based head localization module, respectively. We use raw MVS scans as supervision during training, but, once trained, TEMPEH directly predicts 3D heads in dense correspondence without requiring scans. Predicting one head takes about 0.3 seconds with a median reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art. This enables the efficient capture of large datasets containing multiple people and diverse facial motions. Code, model, and data are publicly available at https://tempeh.is.tue.mpg.de.
现有的捕获密集语义对应的3D头部数据集的方法速度很慢,通常分两个步骤解决问题;多视点立体(MVS)重建,然后进行非刚性配准。为了简化这一过程,我们引入了TEMPEH (Towards Estimation of 3D mesh from performance of Expressive Heads),从校准的多视图图像中直接推断出密集对应的3D头部。注册3D扫描数据集通常需要手动调整参数,以在准确拟合扫描表面和对扫描噪声和异常值的鲁棒性之间找到正确的平衡。相反,我们建议在训练TEMPEH的同时联合注册一个3D头部数据集。具体来说,在训练过程中,我们最小化了通常用于表面配准的几何损失,有效地利用TEMPEH作为正则化器。我们的多视图头部推断建立在体积特征表示的基础上,该特征表示使用相机校准信息对每个视图的特征进行采样和融合。为了考虑局部遮挡和允许头部运动的大捕获量,我们分别使用了视图和表面感知特征融合以及基于空间转换器的头部定位模块。我们在训练期间使用原始的MVS扫描作为监督,但是,一旦训练,TEMPEH直接预测密集对应的3D头部而不需要扫描。预测一个人头大约需要0.3秒,重建误差中值为0.26毫米,比目前最先进的技术低64%。这使得有效捕获包含多人和不同面部动作的大型数据集成为可能。代码、模型和数据可在https://tempeh.is.tue.mpg.de上公开获得。