Wenfeng Song , Zhongyong Ye , Meng Sun , Xia Hou , Shuai Li , Aimin Hao
{"title":"AttriDiffuser: Adversarially enhanced diffusion model for text-to-facial attribute image synthesis","authors":"Wenfeng Song , Zhongyong Ye , Meng Sun , Xia Hou , Shuai Li , Aimin Hao","doi":"10.1016/j.patcog.2025.111447","DOIUrl":null,"url":null,"abstract":"<div><div>In the progressive domain of computer vision, generating high-fidelity facial images from textual descriptions with precision remains a complex challenge. While existing diffusion models have demonstrated capabilities in text-to-image synthesis, they often struggle with capturing intricate details from complex, multi-attribute textual descriptions, leading to entity or attribute loss and inaccurate combinations. We propose AttriDiffuser, a novel model designed to ensure that each entity and attribute in textual descriptions is distinctly and accurately represented in the synthesized images. AttriDiffuser utilizes a text-driven attribute diffusion adversarial model, enhancing the correspondence between textual attributes and image features. It incorporates an attribute-gating cross-attention mechanism seamlessly into the adversarial learning enhanced diffusion model. AttriDiffuser advances traditional diffusion models by integrating a face diversity discriminator, which augments adversarial training and promotes the generation of diverse yet precise facial images in alignment with complex textual descriptions. Our empirical evaluation, conducted on the renowned Multimodal VoxCeleb and CelebA-HQ datasets, and benchmarked against other state-of-the-art models, demonstrates AttriDiffuser’s superior efficacy. The results indicate its unparalleled capability to synthesize high-quality facial images with rigorous adherence to complex, multi-faceted textual descriptions, marking a significant advancement in text-to-facial attribute synthesis. Our code and model will be made publicly available at <span><span>https://github.com/sunmeng7/AttriDiffuser</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111447"},"PeriodicalIF":7.5000,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325001074","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In the progressive domain of computer vision, generating high-fidelity facial images from textual descriptions with precision remains a complex challenge. While existing diffusion models have demonstrated capabilities in text-to-image synthesis, they often struggle with capturing intricate details from complex, multi-attribute textual descriptions, leading to entity or attribute loss and inaccurate combinations. We propose AttriDiffuser, a novel model designed to ensure that each entity and attribute in textual descriptions is distinctly and accurately represented in the synthesized images. AttriDiffuser utilizes a text-driven attribute diffusion adversarial model, enhancing the correspondence between textual attributes and image features. It incorporates an attribute-gating cross-attention mechanism seamlessly into the adversarial learning enhanced diffusion model. AttriDiffuser advances traditional diffusion models by integrating a face diversity discriminator, which augments adversarial training and promotes the generation of diverse yet precise facial images in alignment with complex textual descriptions. Our empirical evaluation, conducted on the renowned Multimodal VoxCeleb and CelebA-HQ datasets, and benchmarked against other state-of-the-art models, demonstrates AttriDiffuser’s superior efficacy. The results indicate its unparalleled capability to synthesize high-quality facial images with rigorous adherence to complex, multi-faceted textual descriptions, marking a significant advancement in text-to-facial attribute synthesis. Our code and model will be made publicly available at https://github.com/sunmeng7/AttriDiffuser.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.