Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2023-11-06 DOI:10.1007/s11263-023-01935-2

Yaokun Li, Guang Tan, Chao Gou

{"title":"Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose","authors":"Yaokun Li, Guang Tan, Chao Gou","doi":"10.1007/s11263-023-01935-2","DOIUrl":null,"url":null,"abstract":"<p>Landmark detection under large pose with occlusion has been one of the challenging problems in the field of facial analysis. Recently, many works have predicted pose or occlusion together in the multi-task learning (MTL) paradigm, trying to tap into their dependencies and thus alleviate this issue. However, such implicit dependencies are weakly interpretable and inconsistent with the way humans exploit inter-task coupling relations, i.e., accommodating the induced explicit effects. This is one of the essentials that hinders their performance. To this end, in this paper, we propose a Cascaded Iterative Transformer (CIT) to jointly predict facial landmark, occlusion probability, and pose. The proposed CIT, besides implicitly mining task dependencies in a shared encoder, innovatively employs a cost-effective and portability-friendly strategy to pass the decoders’ predictions as prior knowledge to human-like exploit the coupling-induced effects. Moreover, to the best of our knowledge, no dataset contains all these task annotations simultaneously, so we introduce a new dataset termed MERL-RAV-FLOP based on the MERL-RAV dataset. We conduct extensive experiments on several challenging datasets (300W-LP, AFLW2000-3D, BIWI, COFW, and MERL-RAV-FLOP) and achieve remarkable results. The code and dataset can be accessed in https://github.com/Iron-LYK/CIT.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"57 16","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-023-01935-2","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Landmark detection under large pose with occlusion has been one of the challenging problems in the field of facial analysis. Recently, many works have predicted pose or occlusion together in the multi-task learning (MTL) paradigm, trying to tap into their dependencies and thus alleviate this issue. However, such implicit dependencies are weakly interpretable and inconsistent with the way humans exploit inter-task coupling relations, i.e., accommodating the induced explicit effects. This is one of the essentials that hinders their performance. To this end, in this paper, we propose a Cascaded Iterative Transformer (CIT) to jointly predict facial landmark, occlusion probability, and pose. The proposed CIT, besides implicitly mining task dependencies in a shared encoder, innovatively employs a cost-effective and portability-friendly strategy to pass the decoders’ predictions as prior knowledge to human-like exploit the coupling-induced effects. Moreover, to the best of our knowledge, no dataset contains all these task annotations simultaneously, so we introduce a new dataset termed MERL-RAV-FLOP based on the MERL-RAV dataset. We conduct extensive experiments on several challenging datasets (300W-LP, AFLW2000-3D, BIWI, COFW, and MERL-RAV-FLOP) and achieve remarkable results. The code and dataset can be accessed in https://github.com/Iron-LYK/CIT.

Abstract Image

查看原文本刊更多论文

用于联合预测面部标志、遮挡概率和头部姿势的级联迭代变换器

遮挡大姿态下的地标检测一直是人脸分析领域中具有挑战性的问题之一。最近，许多工作在多任务学习（MTL）范式中一起预测了姿势或遮挡，试图挖掘它们的相关性，从而缓解这一问题。然而，这种隐式依赖关系的可解释性很弱，与人类利用任务间耦合关系的方式不一致，即适应诱导的显式效应。这是阻碍他们表现的要素之一。为此，在本文中，我们提出了一种级联迭代变换器（CIT）来联合预测面部地标、遮挡概率和姿势。所提出的CIT除了在共享编码器中隐式挖掘任务依赖性外，还创新性地采用了一种成本效益高、可移植性好的策略，将解码器的预测作为先验知识传递给类人利用耦合诱导效应。此外，据我们所知，没有一个数据集同时包含所有这些任务注释，因此我们在MERL-RAV数据集的基础上引入了一个新的数据集，称为MERL-RAV-FLOP。我们在几个具有挑战性的数据集（300W-LP、AFLW2000-3D、BIWI、COFW和MERL-RAV-FLOP）上进行了广泛的实验，并取得了显著的结果。可以在中访问代码和数据集https://github.com/Iron-LYK/CIT.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.