Part-based Face Recognition with Vision Transformers

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference Pub Date : 2022-11-30 DOI:10.48550/arXiv.2212.00057

Zhonglin Sun, Georgios Tzimiropoulos

{"title":"Part-based Face Recognition with Vision Transformers","authors":"Zhonglin Sun, Georgios Tzimiropoulos","doi":"10.48550/arXiv.2212.00057","DOIUrl":null,"url":null,"abstract":"Holistic methods using CNNs and margin-based losses have dominated research on face recognition. In this work, we depart from this setting in two ways: (a) we employ the Vision Transformer as an architecture for training a very strong baseline for face recognition, simply called fViT, which already surpasses most state-of-the-art face recognition methods. (b) Secondly, we capitalize on the Transformer's inherent property to process information (visual tokens) extracted from irregular grids to devise a pipeline for face recognition which is reminiscent of part-based face recognition methods. Our pipeline, called part fViT, simply comprises a lightweight network to predict the coordinates of facial landmarks followed by the Vision Transformer operating on patches extracted from the predicted landmarks, and it is trained end-to-end with no landmark supervision. By learning to extract discriminative patches, our part-based Transformer further boosts the accuracy of our Vision Transformer baseline achieving state-of-the-art accuracy on several face recognition benchmarks.","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"25 1","pages":"611"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2212.00057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Holistic methods using CNNs and margin-based losses have dominated research on face recognition. In this work, we depart from this setting in two ways: (a) we employ the Vision Transformer as an architecture for training a very strong baseline for face recognition, simply called fViT, which already surpasses most state-of-the-art face recognition methods. (b) Secondly, we capitalize on the Transformer's inherent property to process information (visual tokens) extracted from irregular grids to devise a pipeline for face recognition which is reminiscent of part-based face recognition methods. Our pipeline, called part fViT, simply comprises a lightweight network to predict the coordinates of facial landmarks followed by the Vision Transformer operating on patches extracted from the predicted landmarks, and it is trained end-to-end with no landmark supervision. By learning to extract discriminative patches, our part-based Transformer further boosts the accuracy of our Vision Transformer baseline achieving state-of-the-art accuracy on several face recognition benchmarks.

查看原文本刊更多论文

基于零件的视觉变形人脸识别

基于cnn和边缘损失的整体方法在人脸识别研究中占据主导地位。在这项工作中，我们从两个方面偏离了这个设置:(a)我们使用视觉转换器作为一个架构来训练一个非常强大的人脸识别基线，简称为fViT，它已经超过了大多数最先进的人脸识别方法。(b)其次，我们利用Transformer的固有属性来处理从不规则网格中提取的信息(视觉标记)，从而设计了一个用于人脸识别的管道，这让人想起基于部件的人脸识别方法。我们的管道，称为部分fViT，简单地包括一个轻量级的网络来预测面部地标的坐标，然后是视觉转换器操作从预测的地标提取的补丁，它是端到端的训练，没有地标监督。通过学习提取判别补丁，我们基于零件的Transformer进一步提高了Vision Transformer基线的准确性，在几个人脸识别基准上实现了最先进的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference

自引率

0.00%

发文量