Decaf：用于面部和手部交互的单目变形捕捉

ACM Transactions on Graphics (TOG) Pub Date : 2023-09-28 DOI:10.1145/3618329

Soshi Shimada, Vladislav Golyanik, Patrick P'erez, C. Theobalt

{"title":"Decaf：用于面部和手部交互的单目变形捕捉","authors":"Soshi Shimada, Vladislav Golyanik, Patrick P'erez, C. Theobalt","doi":"10.1145/3618329","DOIUrl":null,"url":null,"abstract":"Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects (e.g., two hands or humans interacting with rigid environments). Modelling dense non-rigid object deformations in this setting (e.g. when hands are interacting with a face), remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR, 3D virtual avatar communications, and character animations. This is due to the severe ill-posedness of the monocular view setting and the associated challenges (e.g., in acquiring a dataset for training and evaluation or obtaining the reasonable non-uniform stiffness of the deformable object). While it is possible to naïvely track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations. Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. https://vcai.mpi-inf.mpg.de/projects/Decaf","PeriodicalId":7077,"journal":{"name":"ACM Transactions on Graphics (TOG)","volume":"42 1","pages":"1 - 16"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Decaf: Monocular Deformation Capture for Face and Hand Interactions\",\"authors\":\"Soshi Shimada, Vladislav Golyanik, Patrick P'erez, C. Theobalt\",\"doi\":\"10.1145/3618329\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects (e.g., two hands or humans interacting with rigid environments). Modelling dense non-rigid object deformations in this setting (e.g. when hands are interacting with a face), remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR, 3D virtual avatar communications, and character animations. This is due to the severe ill-posedness of the monocular view setting and the associated challenges (e.g., in acquiring a dataset for training and evaluation or obtaining the reasonable non-uniform stiffness of the deformable object). While it is possible to naïvely track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations. Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. https://vcai.mpi-inf.mpg.de/projects/Decaf\",\"PeriodicalId\":7077,\"journal\":{\"name\":\"ACM Transactions on Graphics (TOG)\",\"volume\":\"42 1\",\"pages\":\"1 - 16\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Graphics (TOG)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3618329\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Graphics (TOG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3618329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

现有的单目 RGB 视频 3D 跟踪方法主要考虑的是铰接和刚性物体（如两只手或人与刚性环境的互动）。虽然这种效果可以提高 AR/VR、3D 虚拟化身通信和角色动画等下游应用的逼真度，但在这种情况下（如手与脸部互动时）对密集的非刚性物体变形进行建模的问题至今仍未得到解决。这是因为单目视图设置存在严重的不确定性和相关挑战（例如，在获取用于训练和评估的数据集或获得可变形物体的合理非均匀刚度方面）。虽然可以使用三维模板或参数化三维模型对多个非刚性物体进行独立的天真追踪，但这种方法会在生成的三维估计值中产生多种伪影，例如深度模糊、不自然的物体内部碰撞以及缺失或难以置信的变形。因此，本文介绍了第一种解决上述基本挑战的方法，该方法可通过单目 RGB 视频以三维方式跟踪与人脸互动的人手。我们将手建模为铰接物体，在主动交互过程中产生非刚性面部变形。我们的方法依赖于一个新的手脸运动和交互捕捉数据集，该数据集具有通过无标记多视角摄像系统获取的逼真脸部变形。作为其创建的关键步骤，我们利用基于位置的动力学和头部组织非均匀刚度估计方法处理重建的原始三维形状，从而获得表面变形、手-脸接触区域和头-手位置的合理注释。我们的神经方法的核心是提供手面深度先验的变分自动编码器，以及通过估计接触和变形来指导三维跟踪的模块。我们最终的手部和面部三维重建无论在定量还是定性方面都比适用于我们环境的几种基线更加真实可信。https://vcai.mpi-inf.mpg.de/projects/Decaf。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Decaf: Monocular Deformation Capture for Face and Hand Interactions

Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects (e.g., two hands or humans interacting with rigid environments). Modelling dense non-rigid object deformations in this setting (e.g. when hands are interacting with a face), remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR, 3D virtual avatar communications, and character animations. This is due to the severe ill-posedness of the monocular view setting and the associated challenges (e.g., in acquiring a dataset for training and evaluation or obtaining the reasonable non-uniform stiffness of the deformable object). While it is possible to naïvely track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations. Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. https://vcai.mpi-inf.mpg.de/projects/Decaf

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Graphics (TOG)

自引率

0.00%

发文量