Si Chen , Mingxuan Lei , Da-Han Wang , Xu-Yao Zhang , Yan Yan , Shunzhi Zhu
{"title":"学习关系导向的人脸属性识别视觉语言转换","authors":"Si Chen , Mingxuan Lei , Da-Han Wang , Xu-Yao Zhang , Yan Yan , Shunzhi Zhu","doi":"10.1016/j.patcog.2025.112063","DOIUrl":null,"url":null,"abstract":"<div><div>Facial Attribute Recognition (FAR) remains a challenging task due to the poor quality of visual images. Existing FAR methods generally adopt the Convolutional Neural Network (CNN) to learn deep features of facial images, and attribute relationships are either obtained by a fixed clustering algorithm or manually grouped. In this paper, we propose a novel Relationship-Guided Vision-Language Transformer, termed RVLT, for FAR. The RVLT method can automatically learn attribute relationships from the linguistic modality, which is used to guide image feature extraction. Specifically, each vision-language Transformer encoder adopts an Image-Text Cross-Attention (ITCA), which is composed of an Image-to-Text Adjustment Attention (ITAA) and a Text-to-Image Guidance Attention (TIGA). The ITAA is used to adjust the text tokens from the linguistic modality to adapt to the visual information, and the TIGA employs the adjusted text tokens as the prior knowledge to guide the distribution of the image embeddings from the visual modality. Moreover, we employ a Token Selection Mechanism (TSM) to reduce the interference factors in the image background, so that the model can pay more attention to the regions related to the face. In addition, an Image-Text Alignment (ITA) loss is used to further align the tokens of the visual and linguistic modalities, and a Text-Aware Classification (TAC) loss is leveraged to ensure the correctness of the attribute relationships learned from the raw text as much as possible. Different from directly using the image embeddings for feature extraction, our method can leverage the attribute relationships automatically learned from the raw text to effectively highlight image features relevant to facial attributes. Experimental results demonstrate that the proposed RVLT achieves 87.34% and 91.80% accuracy on LFWA and CelebA, respectively. In the case of limited labeled data, RVLT outperforms the second-best FAR method by 1.76% and 0.25% accuracy under only 5% LFWA and 0.5% CelebA training data, respectively.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112063"},"PeriodicalIF":7.5000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning relationship-guided vision-language transformer for facial attribute recognition\",\"authors\":\"Si Chen , Mingxuan Lei , Da-Han Wang , Xu-Yao Zhang , Yan Yan , Shunzhi Zhu\",\"doi\":\"10.1016/j.patcog.2025.112063\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Facial Attribute Recognition (FAR) remains a challenging task due to the poor quality of visual images. Existing FAR methods generally adopt the Convolutional Neural Network (CNN) to learn deep features of facial images, and attribute relationships are either obtained by a fixed clustering algorithm or manually grouped. In this paper, we propose a novel Relationship-Guided Vision-Language Transformer, termed RVLT, for FAR. The RVLT method can automatically learn attribute relationships from the linguistic modality, which is used to guide image feature extraction. Specifically, each vision-language Transformer encoder adopts an Image-Text Cross-Attention (ITCA), which is composed of an Image-to-Text Adjustment Attention (ITAA) and a Text-to-Image Guidance Attention (TIGA). The ITAA is used to adjust the text tokens from the linguistic modality to adapt to the visual information, and the TIGA employs the adjusted text tokens as the prior knowledge to guide the distribution of the image embeddings from the visual modality. Moreover, we employ a Token Selection Mechanism (TSM) to reduce the interference factors in the image background, so that the model can pay more attention to the regions related to the face. In addition, an Image-Text Alignment (ITA) loss is used to further align the tokens of the visual and linguistic modalities, and a Text-Aware Classification (TAC) loss is leveraged to ensure the correctness of the attribute relationships learned from the raw text as much as possible. Different from directly using the image embeddings for feature extraction, our method can leverage the attribute relationships automatically learned from the raw text to effectively highlight image features relevant to facial attributes. Experimental results demonstrate that the proposed RVLT achieves 87.34% and 91.80% accuracy on LFWA and CelebA, respectively. In the case of limited labeled data, RVLT outperforms the second-best FAR method by 1.76% and 0.25% accuracy under only 5% LFWA and 0.5% CelebA training data, respectively.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"170 \",\"pages\":\"Article 112063\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S003132032500723X\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S003132032500723X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Learning relationship-guided vision-language transformer for facial attribute recognition
Facial Attribute Recognition (FAR) remains a challenging task due to the poor quality of visual images. Existing FAR methods generally adopt the Convolutional Neural Network (CNN) to learn deep features of facial images, and attribute relationships are either obtained by a fixed clustering algorithm or manually grouped. In this paper, we propose a novel Relationship-Guided Vision-Language Transformer, termed RVLT, for FAR. The RVLT method can automatically learn attribute relationships from the linguistic modality, which is used to guide image feature extraction. Specifically, each vision-language Transformer encoder adopts an Image-Text Cross-Attention (ITCA), which is composed of an Image-to-Text Adjustment Attention (ITAA) and a Text-to-Image Guidance Attention (TIGA). The ITAA is used to adjust the text tokens from the linguistic modality to adapt to the visual information, and the TIGA employs the adjusted text tokens as the prior knowledge to guide the distribution of the image embeddings from the visual modality. Moreover, we employ a Token Selection Mechanism (TSM) to reduce the interference factors in the image background, so that the model can pay more attention to the regions related to the face. In addition, an Image-Text Alignment (ITA) loss is used to further align the tokens of the visual and linguistic modalities, and a Text-Aware Classification (TAC) loss is leveraged to ensure the correctness of the attribute relationships learned from the raw text as much as possible. Different from directly using the image embeddings for feature extraction, our method can leverage the attribute relationships automatically learned from the raw text to effectively highlight image features relevant to facial attributes. Experimental results demonstrate that the proposed RVLT achieves 87.34% and 91.80% accuracy on LFWA and CelebA, respectively. In the case of limited labeled data, RVLT outperforms the second-best FAR method by 1.76% and 0.25% accuracy under only 5% LFWA and 0.5% CelebA training data, respectively.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.