Learning Multi-Scale Knowledge-Guided Features for Text-Guided Face Recognition

IF 5

IEEE transactions on biometrics, behavior, and identity science Pub Date : 2024-09-23 DOI:10.1109/TBIOM.2024.3466216

Md Mahedi Hasan;Shoaib Meraj Sami;Nasser M. Nasrabadi;Jeremy Dawson

{"title":"Learning Multi-Scale Knowledge-Guided Features for Text-Guided Face Recognition","authors":"Md Mahedi Hasan;Shoaib Meraj Sami;Nasser M. Nasrabadi;Jeremy Dawson","doi":"10.1109/TBIOM.2024.3466216","DOIUrl":null,"url":null,"abstract":"Text-guided face recognition (TGFR) aims to improve the performance of state-of-the-art face recognition (FR) algorithms by incorporating auxiliary information, such as distinct facial marks and attributes, provided as natural language descriptions. Current TGFR algorithms have been proven to be highly effective in addressing performance drops in state-of-the-art FR models, particularly in scenarios involving sensor noise, low resolution, and turbulence effects. Although existing methods explore various algorithms using different cross-modal alignment and fusion techniques, they encounter practical limitations in real-world applications. For example, during inference, textual descriptions associated with face images may be missing, lacking crucial details, or incorrect. Furthermore, the presence of inherent modality heterogeneity poses a significant challenge in achieving effective cross-modal alignment. To address these challenges, we introduce CaptionFace, a TGFR framework that integrates GPTFace, a face image captioning model designed to generate context-rich natural language descriptions from low-resolution facial images. By leveraging GPTFace, we overcome the issue of missing textual descriptions, expanding the applicability of CaptionFace to single-modal FR datasets. Additionally, we introduce a multi-scale feature alignment (MSFA) module to ensure semantic alignment between face-caption pairs at different granularities. Furthermore, we introduce an attribute-aware loss and perform knowledge adaptation to specifically adapt textual knowledge from facial features. Extensive experiments on three face-caption datasets and various unconstrained single-modal benchmark datasets demonstrate that CaptionFace significantly outperforms state-of-the-art FR models and existing TGFR approaches.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"7 2","pages":"195-209"},"PeriodicalIF":5.0000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10689338/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Text-guided face recognition (TGFR) aims to improve the performance of state-of-the-art face recognition (FR) algorithms by incorporating auxiliary information, such as distinct facial marks and attributes, provided as natural language descriptions. Current TGFR algorithms have been proven to be highly effective in addressing performance drops in state-of-the-art FR models, particularly in scenarios involving sensor noise, low resolution, and turbulence effects. Although existing methods explore various algorithms using different cross-modal alignment and fusion techniques, they encounter practical limitations in real-world applications. For example, during inference, textual descriptions associated with face images may be missing, lacking crucial details, or incorrect. Furthermore, the presence of inherent modality heterogeneity poses a significant challenge in achieving effective cross-modal alignment. To address these challenges, we introduce CaptionFace, a TGFR framework that integrates GPTFace, a face image captioning model designed to generate context-rich natural language descriptions from low-resolution facial images. By leveraging GPTFace, we overcome the issue of missing textual descriptions, expanding the applicability of CaptionFace to single-modal FR datasets. Additionally, we introduce a multi-scale feature alignment (MSFA) module to ensure semantic alignment between face-caption pairs at different granularities. Furthermore, we introduce an attribute-aware loss and perform knowledge adaptation to specifically adapt textual knowledge from facial features. Extensive experiments on three face-caption datasets and various unconstrained single-modal benchmark datasets demonstrate that CaptionFace significantly outperforms state-of-the-art FR models and existing TGFR approaches.

查看原文本刊更多论文

文本引导人脸识别中多尺度知识引导特征的学习

文本引导人脸识别（TGFR）旨在通过结合辅助信息（如不同的面部标记和属性）作为自然语言描述来提高最先进的人脸识别（FR）算法的性能。目前的TGFR算法已被证明在解决最先进的FR模型的性能下降方面非常有效，特别是在涉及传感器噪声、低分辨率和湍流影响的情况下。尽管现有的方法使用不同的跨模态对齐和融合技术探索了各种算法，但它们在实际应用中遇到了实际限制。例如，在推理过程中，与人脸图像相关的文本描述可能会缺失，缺乏关键细节或不正确。此外，固有模态异质性的存在对实现有效的跨模态对齐提出了重大挑战。为了应对这些挑战，我们引入了CaptionFace，这是一个集成了GPTFace的TGFR框架，GPTFace是一种人脸图像字幕模型，旨在从低分辨率人脸图像中生成上下文丰富的自然语言描述。通过利用GPTFace，我们克服了缺少文本描述的问题，扩大了CaptionFace对单模态FR数据集的适用性。此外，我们引入了一个多尺度特征对齐（MSFA）模块，以确保不同粒度的人脸-标题对之间的语义对齐。此外，我们引入了属性感知损失，并进行了知识自适应，以专门适应面部特征的文本知识。在三个人脸标题数据集和各种无约束单模态基准数据集上进行的大量实验表明，CaptionFace显著优于最先进的FR模型和现有的TGFR方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on biometrics, behavior, and identity science

CiteScore

10.90

自引率

0.00%

发文量