CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao
{"title":"CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization","authors":"Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao","doi":"arxiv-2409.05606","DOIUrl":null,"url":null,"abstract":"Subject-driven text-to-image (T2I) customization has drawn significant\ninterest in academia and industry. This task enables pre-trained models to\ngenerate novel images based on unique subjects. Existing studies adopt a\nself-reconstructive perspective, focusing on capturing all details of a single\nimage, which will misconstrue the specific image's irrelevant attributes (e.g.,\nview, pose, and background) as the subject intrinsic attributes. This\nmisconstruction leads to both overfitting or underfitting of irrelevant and\nintrinsic attributes of the subject, i.e., these attributes are\nover-represented or under-represented simultaneously, causing a trade-off\nbetween similarity and controllability. In this study, we argue an ideal\nsubject representation can be achieved by a cross-differential perspective,\ni.e., decoupling subject intrinsic attributes from irrelevant attributes via\ncontrastive learning, which allows the model to focus more on intrinsic\nattributes through intra-consistency (features of the same subject are\nspatially closer) and inter-distinctiveness (features of different subjects\nhave distinguished differences). Specifically, we propose CustomContrast, a\nnovel framework, which includes a Multilevel Contrastive Learning (MCL)\nparadigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is\nused to extract intrinsic features of subjects from high-level semantics to\nlow-level appearance through crossmodal semantic contrastive learning and\nmultiscale appearance contrastive learning. To facilitate contrastive learning,\nwe introduce the MFI encoder to capture cross-modal representations. Extensive\nexperiments show the effectiveness of CustomContrast in subject similarity and\ntext controllability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Subject-driven text-to-image (T2I) customization has drawn significant interest in academia and industry. This task enables pre-trained models to generate novel images based on unique subjects. Existing studies adopt a self-reconstructive perspective, focusing on capturing all details of a single image, which will misconstrue the specific image's irrelevant attributes (e.g., view, pose, and background) as the subject intrinsic attributes. This misconstruction leads to both overfitting or underfitting of irrelevant and intrinsic attributes of the subject, i.e., these attributes are over-represented or under-represented simultaneously, causing a trade-off between similarity and controllability. In this study, we argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning, which allows the model to focus more on intrinsic attributes through intra-consistency (features of the same subject are spatially closer) and inter-distinctiveness (features of different subjects have distinguished differences). Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning (MCL) paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is used to extract intrinsic features of subjects from high-level semantics to low-level appearance through crossmodal semantic contrastive learning and multiscale appearance contrastive learning. To facilitate contrastive learning, we introduce the MFI encoder to capture cross-modal representations. Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.
自定义对比:多层次对比视角,实现主题驱动的文本到图像定制
主题驱动的文本到图像(T2I)定制在学术界和工业界引起了极大的兴趣。这项任务使预先训练好的模型能够根据独特的主题生成新颖的图像。现有研究采用自我重构的视角,专注于捕捉单张图像的所有细节,这会将特定图像的无关属性(如视图、姿势和背景)误认为是主体的内在属性。这种误解会导致被摄体的无关属性和内在属性的过度拟合或不足拟合,即这些属性同时被过度呈现或不足呈现,从而造成相似性和可控性之间的权衡。在本研究中,我们认为理想的主体表征可以通过交叉差异视角来实现,即通过对比学习将主体内在属性与无关属性分离开来,从而使模型通过内在一致性(同一主体的特征在空间上更接近)和相互区别性(不同主体的特征有显著差异)更加关注内在属性。具体来说,我们提出了 "自定义对比 "这一高级框架,其中包括多级对比学习(MCL)范式和多模态特征注入(MFI)编码器。MCL 范式用于通过跨模态语义对比学习和多尺度外观对比学习,从高层语义到低层外观提取主体的内在特征。为了促进对比学习,我们引入了 MFI 编码器来捕捉跨模态表征。广泛的实验表明,自定义对比在主体相似性和文本可控性方面非常有效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信