学习关系导向的人脸属性识别视觉语言转换

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-07-07 DOI:10.1016/j.patcog.2025.112063

Si Chen , Mingxuan Lei , Da-Han Wang , Xu-Yao Zhang , Yan Yan , Shunzhi Zhu

{"title":"学习关系导向的人脸属性识别视觉语言转换","authors":"Si Chen , Mingxuan Lei , Da-Han Wang , Xu-Yao Zhang , Yan Yan , Shunzhi Zhu","doi":"10.1016/j.patcog.2025.112063","DOIUrl":null,"url":null,"abstract":"<div><div>Facial Attribute Recognition (FAR) remains a challenging task due to the poor quality of visual images. Existing FAR methods generally adopt the Convolutional Neural Network (CNN) to learn deep features of facial images, and attribute relationships are either obtained by a fixed clustering algorithm or manually grouped. In this paper, we propose a novel Relationship-Guided Vision-Language Transformer, termed RVLT, for FAR. The RVLT method can automatically learn attribute relationships from the linguistic modality, which is used to guide image feature extraction. Specifically, each vision-language Transformer encoder adopts an Image-Text Cross-Attention (ITCA), which is composed of an Image-to-Text Adjustment Attention (ITAA) and a Text-to-Image Guidance Attention (TIGA). The ITAA is used to adjust the text tokens from the linguistic modality to adapt to the visual information, and the TIGA employs the adjusted text tokens as the prior knowledge to guide the distribution of the image embeddings from the visual modality. Moreover, we employ a Token Selection Mechanism (TSM) to reduce the interference factors in the image background, so that the model can pay more attention to the regions related to the face. In addition, an Image-Text Alignment (ITA) loss is used to further align the tokens of the visual and linguistic modalities, and a Text-Aware Classification (TAC) loss is leveraged to ensure the correctness of the attribute relationships learned from the raw text as much as possible. Different from directly using the image embeddings for feature extraction, our method can leverage the attribute relationships automatically learned from the raw text to effectively highlight image features relevant to facial attributes. Experimental results demonstrate that the proposed RVLT achieves 87.34% and 91.80% accuracy on LFWA and CelebA, respectively. In the case of limited labeled data, RVLT outperforms the second-best FAR method by 1.76% and 0.25% accuracy under only 5% LFWA and 0.5% CelebA training data, respectively.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112063"},"PeriodicalIF":7.5000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning relationship-guided vision-language transformer for facial attribute recognition\",\"authors\":\"Si Chen , Mingxuan Lei , Da-Han Wang , Xu-Yao Zhang , Yan Yan , Shunzhi Zhu\",\"doi\":\"10.1016/j.patcog.2025.112063\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Facial Attribute Recognition (FAR) remains a challenging task due to the poor quality of visual images. Existing FAR methods generally adopt the Convolutional Neural Network (CNN) to learn deep features of facial images, and attribute relationships are either obtained by a fixed clustering algorithm or manually grouped. In this paper, we propose a novel Relationship-Guided Vision-Language Transformer, termed RVLT, for FAR. The RVLT method can automatically learn attribute relationships from the linguistic modality, which is used to guide image feature extraction. Specifically, each vision-language Transformer encoder adopts an Image-Text Cross-Attention (ITCA), which is composed of an Image-to-Text Adjustment Attention (ITAA) and a Text-to-Image Guidance Attention (TIGA). The ITAA is used to adjust the text tokens from the linguistic modality to adapt to the visual information, and the TIGA employs the adjusted text tokens as the prior knowledge to guide the distribution of the image embeddings from the visual modality. Moreover, we employ a Token Selection Mechanism (TSM) to reduce the interference factors in the image background, so that the model can pay more attention to the regions related to the face. In addition, an Image-Text Alignment (ITA) loss is used to further align the tokens of the visual and linguistic modalities, and a Text-Aware Classification (TAC) loss is leveraged to ensure the correctness of the attribute relationships learned from the raw text as much as possible. Different from directly using the image embeddings for feature extraction, our method can leverage the attribute relationships automatically learned from the raw text to effectively highlight image features relevant to facial attributes. Experimental results demonstrate that the proposed RVLT achieves 87.34% and 91.80% accuracy on LFWA and CelebA, respectively. In the case of limited labeled data, RVLT outperforms the second-best FAR method by 1.76% and 0.25% accuracy under only 5% LFWA and 0.5% CelebA training data, respectively.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"170 \",\"pages\":\"Article 112063\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S003132032500723X\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S003132032500723X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于视觉图像质量差，人脸属性识别（FAR）仍然是一项具有挑战性的任务。现有的FAR方法一般采用卷积神经网络（CNN）来学习人脸图像的深度特征，属性关系要么通过固定的聚类算法获得，要么手动分组。在本文中，我们提出了一种新的关系导向视觉语言转换器，称为RVLT，用于FAR。RVLT方法可以从语言模态中自动学习属性关系，用于指导图像特征提取。具体来说，每个视觉语言转换器编码器都采用一个图像-文本交叉注意（ITCA），它由一个图像-文本调整注意（ITAA）和一个文本-图像引导注意（TIGA）组成。ITAA用于调整语言模态中的文本标记以适应视觉信息，TIGA将调整后的文本标记作为先验知识来指导视觉模态中的图像嵌入分布。此外，我们采用令牌选择机制（TSM）来减少图像背景中的干扰因素，使模型能够更加关注与人脸相关的区域。此外，使用图像-文本对齐（ITA）损失来进一步对齐视觉和语言模式的标记，并利用文本感知分类（TAC）损失来尽可能确保从原始文本学习的属性关系的正确性。与直接使用图像嵌入进行特征提取不同，我们的方法可以利用从原始文本中自动学习到的属性关系，有效地突出显示与面部属性相关的图像特征。实验结果表明，所提出的RVLT在LFWA和CelebA上的准确率分别达到87.34%和91.80%。在标记数据有限的情况下，RVLT在仅5% LFWA和0.5% CelebA训练数据下的准确率分别比第二好的FAR方法高1.76%和0.25%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning relationship-guided vision-language transformer for facial attribute recognition

Facial Attribute Recognition (FAR) remains a challenging task due to the poor quality of visual images. Existing FAR methods generally adopt the Convolutional Neural Network (CNN) to learn deep features of facial images, and attribute relationships are either obtained by a fixed clustering algorithm or manually grouped. In this paper, we propose a novel Relationship-Guided Vision-Language Transformer, termed RVLT, for FAR. The RVLT method can automatically learn attribute relationships from the linguistic modality, which is used to guide image feature extraction. Specifically, each vision-language Transformer encoder adopts an Image-Text Cross-Attention (ITCA), which is composed of an Image-to-Text Adjustment Attention (ITAA) and a Text-to-Image Guidance Attention (TIGA). The ITAA is used to adjust the text tokens from the linguistic modality to adapt to the visual information, and the TIGA employs the adjusted text tokens as the prior knowledge to guide the distribution of the image embeddings from the visual modality. Moreover, we employ a Token Selection Mechanism (TSM) to reduce the interference factors in the image background, so that the model can pay more attention to the regions related to the face. In addition, an Image-Text Alignment (ITA) loss is used to further align the tokens of the visual and linguistic modalities, and a Text-Aware Classification (TAC) loss is leveraged to ensure the correctness of the attribute relationships learned from the raw text as much as possible. Different from directly using the image embeddings for feature extraction, our method can leverage the attribute relationships automatically learned from the raw text to effectively highlight image features relevant to facial attributes. Experimental results demonstrate that the proposed RVLT achieves 87.34% and 91.80% accuracy on LFWA and CelebA, respectively. In the case of limited labeled data, RVLT outperforms the second-best FAR method by 1.76% and 0.25% accuracy under only 5% LFWA and 0.5% CelebA training data, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.