transface++：重新思考人脸识别范式，重点是准确性，效率和安全性。

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-09-30 DOI:10.1109/tpami.2025.3616149

Jun Dan,Yang Liu,Baigui Sun,Jiankang Deng,Shan Luo

{"title":"transface++：重新思考人脸识别范式，重点是准确性，效率和安全性。","authors":"Jun Dan,Yang Liu,Baigui Sun,Jiankang Deng,Shan Luo","doi":"10.1109/tpami.2025.3616149","DOIUrl":null,"url":null,"abstract":"Face Recognition (FR) technology has made significant strides with the emergence of deep learning. Typically, most existing FR models are built upon Convolutional Neural Networks (CNN) and take RGB face images as the model's input. In this work, we take a closer look at existing FR paradigms from high-efficiency, security, and precision perspectives, and identify the following three problems: (i) CNN frameworks are vulnerable in capturing global facial features and modeling the correlations between local facial features. (ii) Selecting RGB face images as the model's input greatly degrades the model's inference efficiency, increasing the extra computation costs. (iii) In the real-world FR system that operates on RGB face images, the integrity of user privacy may be compromised if hackers successfully penetrate and gain access to the input of this model. To solve these three issues, we propose two novel FR frameworks, i.e., TransFace and TransFace++, which successfully explore the feasibility of applying ViTs and image bytes to FR tasks, respectively. Firstly, as revealed from our observations, we find that ViTs perform vulnerably when applied to FR scenarios with extremely large datasets. We investigate the reasons for this phenomenon and discover that the existing data augmentation approaches and hard sample mining strategies are incompatible with ViTs-based FR backbone due to the lack of tailored consideration on preserving face structural information and leveraging each local token information. To remedy these problems, we first propose a superior FR model called TransFace, which contains a patch-level data augmentation strategy named Dominant Patch Amplitude Perturbation (DPAP) and a hard sample mining strategy named Entropy-guided Hard Sample Mining (EHSM). Furthermore, to improve inference efficiency and user privacy protection, we investigate the intrinsic property of image bytes and propose a superior FR model termed TransFace++. The proposed model is trained directly on image bytes, presenting a novel approach to address the aforementioned issues. Specifically, considering the importance of local correlations in bytes, an image bytes compression strategy named Topology-based Image Bytes Compression (TIBC) is introduced to extract prominent features from the raw bytes and integrate these features with byte embeddings, effectively mitigating information loss during the bytes mapping process. Moreover, to strengthen the model's perception on geometric information encoded in image bytes, a novel cross-attention module named Structure Information-guided Cross-Attention (SICA) is designed to inject structure information into byte tokens for information interaction, significantly improving the model's generalization ability. Experiments on popular face benchmarks demonstrate the superiority of our TransFace and TransFace++. Code is available at https://github.com/DanJun6737/TransFace_pp.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"7 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TransFace++: Rethinking the Face Recognition Paradigm with a Focus on Accuracy, Efficiency, and Security.\",\"authors\":\"Jun Dan,Yang Liu,Baigui Sun,Jiankang Deng,Shan Luo\",\"doi\":\"10.1109/tpami.2025.3616149\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Face Recognition (FR) technology has made significant strides with the emergence of deep learning. Typically, most existing FR models are built upon Convolutional Neural Networks (CNN) and take RGB face images as the model's input. In this work, we take a closer look at existing FR paradigms from high-efficiency, security, and precision perspectives, and identify the following three problems: (i) CNN frameworks are vulnerable in capturing global facial features and modeling the correlations between local facial features. (ii) Selecting RGB face images as the model's input greatly degrades the model's inference efficiency, increasing the extra computation costs. (iii) In the real-world FR system that operates on RGB face images, the integrity of user privacy may be compromised if hackers successfully penetrate and gain access to the input of this model. To solve these three issues, we propose two novel FR frameworks, i.e., TransFace and TransFace++, which successfully explore the feasibility of applying ViTs and image bytes to FR tasks, respectively. Firstly, as revealed from our observations, we find that ViTs perform vulnerably when applied to FR scenarios with extremely large datasets. We investigate the reasons for this phenomenon and discover that the existing data augmentation approaches and hard sample mining strategies are incompatible with ViTs-based FR backbone due to the lack of tailored consideration on preserving face structural information and leveraging each local token information. To remedy these problems, we first propose a superior FR model called TransFace, which contains a patch-level data augmentation strategy named Dominant Patch Amplitude Perturbation (DPAP) and a hard sample mining strategy named Entropy-guided Hard Sample Mining (EHSM). Furthermore, to improve inference efficiency and user privacy protection, we investigate the intrinsic property of image bytes and propose a superior FR model termed TransFace++. The proposed model is trained directly on image bytes, presenting a novel approach to address the aforementioned issues. Specifically, considering the importance of local correlations in bytes, an image bytes compression strategy named Topology-based Image Bytes Compression (TIBC) is introduced to extract prominent features from the raw bytes and integrate these features with byte embeddings, effectively mitigating information loss during the bytes mapping process. Moreover, to strengthen the model's perception on geometric information encoded in image bytes, a novel cross-attention module named Structure Information-guided Cross-Attention (SICA) is designed to inject structure information into byte tokens for information interaction, significantly improving the model's generalization ability. Experiments on popular face benchmarks demonstrate the superiority of our TransFace and TransFace++. Code is available at https://github.com/DanJun6737/TransFace_pp.\",\"PeriodicalId\":13426,\"journal\":{\"name\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tpami.2025.3616149\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3616149","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

随着深度学习的出现，人脸识别（FR）技术取得了重大进展。通常，大多数现有的人脸识别模型都是建立在卷积神经网络（CNN）的基础上，并将RGB人脸图像作为模型的输入。在这项工作中，我们从高效、安全和精确的角度仔细研究了现有的人脸识别范式，并发现了以下三个问题：(i) CNN框架在捕获全局面部特征和建模局部面部特征之间的相关性方面很脆弱。（ii）选择RGB人脸图像作为模型的输入，大大降低了模型的推理效率，增加了额外的计算成本。（iii）在RGB人脸图像操作的现实FR系统中，如果黑客成功渗透并获得该模型的输入，则可能会损害用户隐私的完整性。为了解决这三个问题，我们提出了两个新的FR框架，即TransFace和transface++，它们分别成功地探索了将vit和图像字节应用于FR任务的可行性。首先，根据我们的观察，我们发现当应用于具有超大数据集的FR场景时，vit表现出脆弱性。我们研究了这种现象的原因，发现现有的数据增强方法和硬样本挖掘策略由于缺乏对保留面部结构信息和利用每个本地令牌信息的量身定制考虑而与基于vits的FR主干不兼容。为了解决这些问题，我们首先提出了一个名为TransFace的高级FR模型，该模型包含一个名为显性斑块振幅摄动（DPAP）的斑块级数据增强策略和一个名为熵引导硬样本挖掘（EHSM）的硬样本挖掘策略。此外，为了提高推理效率和保护用户隐私，我们研究了图像字节的内在属性，并提出了一种称为transface++的高级FR模型。该模型直接在图像字节上进行训练，提出了一种解决上述问题的新方法。具体而言，考虑到字节局部相关性的重要性，引入了一种基于拓扑的图像字节压缩策略（TIBC），从原始字节中提取突出特征，并将这些特征与字节嵌入相结合，有效地减轻了字节映射过程中的信息丢失。此外，为了增强模型对图像字节编码几何信息的感知能力，设计了结构信息导向交叉注意模块（Structure information -guided cross-attention， SICA），将结构信息注入字节令牌中进行信息交互，显著提高了模型的泛化能力。在流行的人脸基准测试上的实验证明了我们的TransFace和transface++的优越性。代码可从https://github.com/DanJun6737/TransFace_pp获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TransFace++: Rethinking the Face Recognition Paradigm with a Focus on Accuracy, Efficiency, and Security.

Face Recognition (FR) technology has made significant strides with the emergence of deep learning. Typically, most existing FR models are built upon Convolutional Neural Networks (CNN) and take RGB face images as the model's input. In this work, we take a closer look at existing FR paradigms from high-efficiency, security, and precision perspectives, and identify the following three problems: (i) CNN frameworks are vulnerable in capturing global facial features and modeling the correlations between local facial features. (ii) Selecting RGB face images as the model's input greatly degrades the model's inference efficiency, increasing the extra computation costs. (iii) In the real-world FR system that operates on RGB face images, the integrity of user privacy may be compromised if hackers successfully penetrate and gain access to the input of this model. To solve these three issues, we propose two novel FR frameworks, i.e., TransFace and TransFace++, which successfully explore the feasibility of applying ViTs and image bytes to FR tasks, respectively. Firstly, as revealed from our observations, we find that ViTs perform vulnerably when applied to FR scenarios with extremely large datasets. We investigate the reasons for this phenomenon and discover that the existing data augmentation approaches and hard sample mining strategies are incompatible with ViTs-based FR backbone due to the lack of tailored consideration on preserving face structural information and leveraging each local token information. To remedy these problems, we first propose a superior FR model called TransFace, which contains a patch-level data augmentation strategy named Dominant Patch Amplitude Perturbation (DPAP) and a hard sample mining strategy named Entropy-guided Hard Sample Mining (EHSM). Furthermore, to improve inference efficiency and user privacy protection, we investigate the intrinsic property of image bytes and propose a superior FR model termed TransFace++. The proposed model is trained directly on image bytes, presenting a novel approach to address the aforementioned issues. Specifically, considering the importance of local correlations in bytes, an image bytes compression strategy named Topology-based Image Bytes Compression (TIBC) is introduced to extract prominent features from the raw bytes and integrate these features with byte embeddings, effectively mitigating information loss during the bytes mapping process. Moreover, to strengthen the model's perception on geometric information encoded in image bytes, a novel cross-attention module named Structure Information-guided Cross-Attention (SICA) is designed to inject structure information into byte tokens for information interaction, significantly improving the model's generalization ability. Experiments on popular face benchmarks demonstrate the superiority of our TransFace and TransFace++. Code is available at https://github.com/DanJun6737/TransFace_pp.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.