FaceChain-MMID：通过分割和合并多模态表示生成高度身份一致的现实肖像

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-05-28 DOI:10.1016/j.patcog.2025.111858

Chao Xu , Fei Wang , Cheng Yu , Baigui Sun , Jian Zhao

{"title":"FaceChain-MMID：通过分割和合并多模态表示生成高度身份一致的现实肖像","authors":"Chao Xu , Fei Wang , Cheng Yu , Baigui Sun , Jian Zhao","doi":"10.1016/j.patcog.2025.111858","DOIUrl":null,"url":null,"abstract":"<div><div>Recent advancements in text-to-image generation have made significant strides in customizing realistic human photos. Most of the existing methods focus on addressing efficiency issues to avoid resource-intensive and time-consuming subject-specific fine-tuning. However, they lack an in-depth exploration for identity preservation, thus suffering from significant degradation in real scenarios. We propose FaceChain-MMID in response to this challenge. First, we comprehensively represent facial identity using three factors: the face image to provide basic identity, the segmentation mask to refine the facial geometry, the text prompts to further supplement additional identity-related attributes. Building upon these multi-modal features, we propose a novel dividing and merging strategy to support highly identity-consistent personalized portrait generation. Specifically, the dividing stage ensures that each modality fully expresses its own information by training independent uni-modal conditional diffusion. The subsequent merging stage introduces an efficient modal-specific proxy module to sufficiently combine the noise from each branch at latent denoising steps, where incorporates identity-centric face pairs, a filtering mechanism, and truncated loss to enhance inter-modal complementarity. Extensive qualitative and quantitative experiments demonstrate the superior performance of our approach in preserving identity.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"168 ","pages":"Article 111858"},"PeriodicalIF":7.6000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FaceChain-MMID: Generating highly identity-consistent realistic portraits via dividing & merging multi-modal representations\",\"authors\":\"Chao Xu , Fei Wang , Cheng Yu , Baigui Sun , Jian Zhao\",\"doi\":\"10.1016/j.patcog.2025.111858\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recent advancements in text-to-image generation have made significant strides in customizing realistic human photos. Most of the existing methods focus on addressing efficiency issues to avoid resource-intensive and time-consuming subject-specific fine-tuning. However, they lack an in-depth exploration for identity preservation, thus suffering from significant degradation in real scenarios. We propose FaceChain-MMID in response to this challenge. First, we comprehensively represent facial identity using three factors: the face image to provide basic identity, the segmentation mask to refine the facial geometry, the text prompts to further supplement additional identity-related attributes. Building upon these multi-modal features, we propose a novel dividing and merging strategy to support highly identity-consistent personalized portrait generation. Specifically, the dividing stage ensures that each modality fully expresses its own information by training independent uni-modal conditional diffusion. The subsequent merging stage introduces an efficient modal-specific proxy module to sufficiently combine the noise from each branch at latent denoising steps, where incorporates identity-centric face pairs, a filtering mechanism, and truncated loss to enhance inter-modal complementarity. Extensive qualitative and quantitative experiments demonstrate the superior performance of our approach in preserving identity.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"168 \",\"pages\":\"Article 111858\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325005187\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325005187","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

文本到图像生成的最新进展在定制逼真的人体照片方面取得了重大进展。大多数现有方法侧重于解决效率问题，以避免资源密集型和耗时的特定主题微调。然而，它们缺乏对身份保存的深入探索，因此在真实场景中遭受了严重的退化。为了应对这一挑战，我们提出了FaceChain-MMID。首先，利用人脸图像提供基本身份信息，利用分割掩码细化人脸几何形状，利用文本提示进一步补充额外的身份相关属性，对人脸身份进行综合表征。基于这些多模态特征，我们提出了一种新的划分和合并策略，以支持高度身份一致的个性化肖像生成。具体来说，划分阶段通过训练独立的单模态条件扩散，确保每个模态充分表达自己的信息。随后的合并阶段引入了一个有效的特定于模态的代理模块，以在潜在去噪步骤中充分结合来自每个分支的噪声，其中包含以身份为中心的人脸对、过滤机制和截断损失，以增强模态间的互补性。大量的定性和定量实验证明了我们的方法在保持身份方面的优越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FaceChain-MMID: Generating highly identity-consistent realistic portraits via dividing & merging multi-modal representations

Recent advancements in text-to-image generation have made significant strides in customizing realistic human photos. Most of the existing methods focus on addressing efficiency issues to avoid resource-intensive and time-consuming subject-specific fine-tuning. However, they lack an in-depth exploration for identity preservation, thus suffering from significant degradation in real scenarios. We propose FaceChain-MMID in response to this challenge. First, we comprehensively represent facial identity using three factors: the face image to provide basic identity, the segmentation mask to refine the facial geometry, the text prompts to further supplement additional identity-related attributes. Building upon these multi-modal features, we propose a novel dividing and merging strategy to support highly identity-consistent personalized portrait generation. Specifically, the dividing stage ensures that each modality fully expresses its own information by training independent uni-modal conditional diffusion. The subsequent merging stage introduces an efficient modal-specific proxy module to sufficiently combine the noise from each branch at latent denoising steps, where incorporates identity-centric face pairs, a filtering mechanism, and truncated loss to enhance inter-modal complementarity. Extensive qualitative and quantitative experiments demonstrate the superior performance of our approach in preserving identity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.