Chao Xu , Fei Wang , Cheng Yu , Baigui Sun , Jian Zhao
{"title":"FaceChain-MMID:通过分割和合并多模态表示生成高度身份一致的现实肖像","authors":"Chao Xu , Fei Wang , Cheng Yu , Baigui Sun , Jian Zhao","doi":"10.1016/j.patcog.2025.111858","DOIUrl":null,"url":null,"abstract":"<div><div>Recent advancements in text-to-image generation have made significant strides in customizing realistic human photos. Most of the existing methods focus on addressing efficiency issues to avoid resource-intensive and time-consuming subject-specific fine-tuning. However, they lack an in-depth exploration for identity preservation, thus suffering from significant degradation in real scenarios. We propose FaceChain-MMID in response to this challenge. First, we comprehensively represent facial identity using three factors: the face image to provide basic identity, the segmentation mask to refine the facial geometry, the text prompts to further supplement additional identity-related attributes. Building upon these multi-modal features, we propose a novel dividing and merging strategy to support highly identity-consistent personalized portrait generation. Specifically, the dividing stage ensures that each modality fully expresses its own information by training independent uni-modal conditional diffusion. The subsequent merging stage introduces an efficient modal-specific proxy module to sufficiently combine the noise from each branch at latent denoising steps, where incorporates identity-centric face pairs, a filtering mechanism, and truncated loss to enhance inter-modal complementarity. Extensive qualitative and quantitative experiments demonstrate the superior performance of our approach in preserving identity.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"168 ","pages":"Article 111858"},"PeriodicalIF":7.6000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FaceChain-MMID: Generating highly identity-consistent realistic portraits via dividing & merging multi-modal representations\",\"authors\":\"Chao Xu , Fei Wang , Cheng Yu , Baigui Sun , Jian Zhao\",\"doi\":\"10.1016/j.patcog.2025.111858\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recent advancements in text-to-image generation have made significant strides in customizing realistic human photos. Most of the existing methods focus on addressing efficiency issues to avoid resource-intensive and time-consuming subject-specific fine-tuning. However, they lack an in-depth exploration for identity preservation, thus suffering from significant degradation in real scenarios. We propose FaceChain-MMID in response to this challenge. First, we comprehensively represent facial identity using three factors: the face image to provide basic identity, the segmentation mask to refine the facial geometry, the text prompts to further supplement additional identity-related attributes. Building upon these multi-modal features, we propose a novel dividing and merging strategy to support highly identity-consistent personalized portrait generation. Specifically, the dividing stage ensures that each modality fully expresses its own information by training independent uni-modal conditional diffusion. The subsequent merging stage introduces an efficient modal-specific proxy module to sufficiently combine the noise from each branch at latent denoising steps, where incorporates identity-centric face pairs, a filtering mechanism, and truncated loss to enhance inter-modal complementarity. Extensive qualitative and quantitative experiments demonstrate the superior performance of our approach in preserving identity.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"168 \",\"pages\":\"Article 111858\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325005187\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325005187","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Recent advancements in text-to-image generation have made significant strides in customizing realistic human photos. Most of the existing methods focus on addressing efficiency issues to avoid resource-intensive and time-consuming subject-specific fine-tuning. However, they lack an in-depth exploration for identity preservation, thus suffering from significant degradation in real scenarios. We propose FaceChain-MMID in response to this challenge. First, we comprehensively represent facial identity using three factors: the face image to provide basic identity, the segmentation mask to refine the facial geometry, the text prompts to further supplement additional identity-related attributes. Building upon these multi-modal features, we propose a novel dividing and merging strategy to support highly identity-consistent personalized portrait generation. Specifically, the dividing stage ensures that each modality fully expresses its own information by training independent uni-modal conditional diffusion. The subsequent merging stage introduces an efficient modal-specific proxy module to sufficiently combine the noise from each branch at latent denoising steps, where incorporates identity-centric face pairs, a filtering mechanism, and truncated loss to enhance inter-modal complementarity. Extensive qualitative and quantitative experiments demonstrate the superior performance of our approach in preserving identity.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.