Multi-Level Pixel-Wise Correspondence Learning for 6DoF Face Pose Estimation

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-04-22 DOI:10.1109/TMM.2024.3391888

Miao Xu;Xiangyu Zhu;Yueying Kao;Zhiwen Chen;Jiangjing Lyu;Zhen Lei

{"title":"Multi-Level Pixel-Wise Correspondence Learning for 6DoF Face Pose Estimation","authors":"Miao Xu;Xiangyu Zhu;Yueying Kao;Zhiwen Chen;Jiangjing Lyu;Zhen Lei","doi":"10.1109/TMM.2024.3391888","DOIUrl":null,"url":null,"abstract":"In this paper, we focus on estimating six degrees of freedom (6DoF) pose of a face from a single RGB image, which is an important but under-investigated problem in 3D face applications such as face reconstruction, forgery detection and virtual try-on. This problem is different from traditional face pose estimation and 3D face reconstruction since the distance from camera to face should be estimated, which can not be directly regressed due to the non-linearity of the pose space. To solve the problem, we follow Perspective-n-Point (PnP) and predict the correspondences between 3D points in canonical space and 2D facial pixels on the input image to solve the 6DoF pose parameters. In this framework, the central problem of 6DoF estimation is building the correspondence matrix between a set of sampled 2D pixels and 3D points, and we propose a Correspondence Learning Transformer (CLT) to achieve this goal. Specifically, we build the 2D and 3D features with local, global, and semantic information, and employ self-attention to make the 2D and 3D features interact with each other and build the 2D–3D correspondence. Besides, we argue that 6DoF estimation is not only related with face appearance itself but also the facial external context, which contains rich information about the distance to camera. Therefore, we extract global-and-local features from the integration of face and context, where the cropped face image with smaller receptive fields concentrates on the small distortion by perspective projection, and the whole image with large receptive field provides shoulder and environment information. Experiments show that our method achieves a 2.0% improvement of \n<inline-formula><tex-math>$MAE_{r}$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>$ADD$</tex-math></inline-formula>\n on ARKitFace and a 4.0%/0.7% improvement of \n<inline-formula><tex-math>$MAE_{t}$</tex-math></inline-formula>\n on ARKitFace/BIWI.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9423-9435"},"PeriodicalIF":8.4000,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10506678/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we focus on estimating six degrees of freedom (6DoF) pose of a face from a single RGB image, which is an important but under-investigated problem in 3D face applications such as face reconstruction, forgery detection and virtual try-on. This problem is different from traditional face pose estimation and 3D face reconstruction since the distance from camera to face should be estimated, which can not be directly regressed due to the non-linearity of the pose space. To solve the problem, we follow Perspective-n-Point (PnP) and predict the correspondences between 3D points in canonical space and 2D facial pixels on the input image to solve the 6DoF pose parameters. In this framework, the central problem of 6DoF estimation is building the correspondence matrix between a set of sampled 2D pixels and 3D points, and we propose a Correspondence Learning Transformer (CLT) to achieve this goal. Specifically, we build the 2D and 3D features with local, global, and semantic information, and employ self-attention to make the 2D and 3D features interact with each other and build the 2D–3D correspondence. Besides, we argue that 6DoF estimation is not only related with face appearance itself but also the facial external context, which contains rich information about the distance to camera. Therefore, we extract global-and-local features from the integration of face and context, where the cropped face image with smaller receptive fields concentrates on the small distortion by perspective projection, and the whole image with large receptive field provides shoulder and environment information. Experiments show that our method achieves a 2.0% improvement of

$MAE_{r}$

and

$ADD$

on ARKitFace and a 4.0%/0.7% improvement of

$MAE_{t}$

on ARKitFace/BIWI.

查看原文本刊更多论文

用于 6DoF 人脸姿态估计的多级像素对应学习

在本文中，我们重点研究从单张 RGB 图像中估计人脸的六自由度（6DoF）姿态，这是人脸重建、伪造检测和虚拟试穿等三维人脸应用中一个重要但未得到充分研究的问题。这个问题不同于传统的人脸姿态估计和三维人脸重建，因为需要估计摄像头到人脸的距离，而由于姿态空间的非线性，这个距离不能直接回归。为了解决这个问题，我们采用了 "透视点"（Perspective-n-Point，PnP）方法，通过预测正则空间中的三维点与输入图像上的二维人脸像素之间的对应关系来求解 6DoF 姿态参数。在这一框架中，6DoF 估算的核心问题是在一组采样的二维像素和三维点之间建立对应矩阵，我们提出了一种对应学习变换器（CLT）来实现这一目标。具体来说，我们利用局部、全局和语义信息来构建二维和三维特征，并利用自注意使二维和三维特征相互作用，从而构建二维-三维对应关系。此外，我们认为 6DoF 估算不仅与人脸外观本身有关，还与人脸外部环境有关，其中包含丰富的与摄像头距离的信息。因此，我们从人脸和上下文的融合中提取全局和局部特征，其中具有较小感受野的裁剪人脸图像集中了透视投影的微小失真，而具有较大感受野的整个图像则提供了肩部和环境信息。实验表明，我们的方法在 ARKitFace 上实现了 $MAE_{r}$ 和 $ADD$ 2.0% 的改进，在 ARKitFace/BIWI 上实现了 $MAE_{t}$ 4.0%/0.7% 的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.