Miao Xu;Xiangyu Zhu;Yueying Kao;Zhiwen Chen;Jiangjing Lyu;Zhen Lei
{"title":"Multi-Level Pixel-Wise Correspondence Learning for 6DoF Face Pose Estimation","authors":"Miao Xu;Xiangyu Zhu;Yueying Kao;Zhiwen Chen;Jiangjing Lyu;Zhen Lei","doi":"10.1109/TMM.2024.3391888","DOIUrl":null,"url":null,"abstract":"In this paper, we focus on estimating six degrees of freedom (6DoF) pose of a face from a single RGB image, which is an important but under-investigated problem in 3D face applications such as face reconstruction, forgery detection and virtual try-on. This problem is different from traditional face pose estimation and 3D face reconstruction since the distance from camera to face should be estimated, which can not be directly regressed due to the non-linearity of the pose space. To solve the problem, we follow Perspective-n-Point (PnP) and predict the correspondences between 3D points in canonical space and 2D facial pixels on the input image to solve the 6DoF pose parameters. In this framework, the central problem of 6DoF estimation is building the correspondence matrix between a set of sampled 2D pixels and 3D points, and we propose a Correspondence Learning Transformer (CLT) to achieve this goal. Specifically, we build the 2D and 3D features with local, global, and semantic information, and employ self-attention to make the 2D and 3D features interact with each other and build the 2D–3D correspondence. Besides, we argue that 6DoF estimation is not only related with face appearance itself but also the facial external context, which contains rich information about the distance to camera. Therefore, we extract global-and-local features from the integration of face and context, where the cropped face image with smaller receptive fields concentrates on the small distortion by perspective projection, and the whole image with large receptive field provides shoulder and environment information. Experiments show that our method achieves a 2.0% improvement of \n<inline-formula><tex-math>$MAE_{r}$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>$ADD$</tex-math></inline-formula>\n on ARKitFace and a 4.0%/0.7% improvement of \n<inline-formula><tex-math>$MAE_{t}$</tex-math></inline-formula>\n on ARKitFace/BIWI.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9423-9435"},"PeriodicalIF":8.4000,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10506678/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we focus on estimating six degrees of freedom (6DoF) pose of a face from a single RGB image, which is an important but under-investigated problem in 3D face applications such as face reconstruction, forgery detection and virtual try-on. This problem is different from traditional face pose estimation and 3D face reconstruction since the distance from camera to face should be estimated, which can not be directly regressed due to the non-linearity of the pose space. To solve the problem, we follow Perspective-n-Point (PnP) and predict the correspondences between 3D points in canonical space and 2D facial pixels on the input image to solve the 6DoF pose parameters. In this framework, the central problem of 6DoF estimation is building the correspondence matrix between a set of sampled 2D pixels and 3D points, and we propose a Correspondence Learning Transformer (CLT) to achieve this goal. Specifically, we build the 2D and 3D features with local, global, and semantic information, and employ self-attention to make the 2D and 3D features interact with each other and build the 2D–3D correspondence. Besides, we argue that 6DoF estimation is not only related with face appearance itself but also the facial external context, which contains rich information about the distance to camera. Therefore, we extract global-and-local features from the integration of face and context, where the cropped face image with smaller receptive fields concentrates on the small distortion by perspective projection, and the whole image with large receptive field provides shoulder and environment information. Experiments show that our method achieves a 2.0% improvement of
$MAE_{r}$
and
$ADD$
on ARKitFace and a 4.0%/0.7% improvement of
$MAE_{t}$
on ARKitFace/BIWI.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.