Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao
{"title":"CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization","authors":"Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao","doi":"arxiv-2409.05606","DOIUrl":null,"url":null,"abstract":"Subject-driven text-to-image (T2I) customization has drawn significant\ninterest in academia and industry. This task enables pre-trained models to\ngenerate novel images based on unique subjects. Existing studies adopt a\nself-reconstructive perspective, focusing on capturing all details of a single\nimage, which will misconstrue the specific image's irrelevant attributes (e.g.,\nview, pose, and background) as the subject intrinsic attributes. This\nmisconstruction leads to both overfitting or underfitting of irrelevant and\nintrinsic attributes of the subject, i.e., these attributes are\nover-represented or under-represented simultaneously, causing a trade-off\nbetween similarity and controllability. In this study, we argue an ideal\nsubject representation can be achieved by a cross-differential perspective,\ni.e., decoupling subject intrinsic attributes from irrelevant attributes via\ncontrastive learning, which allows the model to focus more on intrinsic\nattributes through intra-consistency (features of the same subject are\nspatially closer) and inter-distinctiveness (features of different subjects\nhave distinguished differences). Specifically, we propose CustomContrast, a\nnovel framework, which includes a Multilevel Contrastive Learning (MCL)\nparadigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is\nused to extract intrinsic features of subjects from high-level semantics to\nlow-level appearance through crossmodal semantic contrastive learning and\nmultiscale appearance contrastive learning. To facilitate contrastive learning,\nwe introduce the MFI encoder to capture cross-modal representations. Extensive\nexperiments show the effectiveness of CustomContrast in subject similarity and\ntext controllability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Subject-driven text-to-image (T2I) customization has drawn significant
interest in academia and industry. This task enables pre-trained models to
generate novel images based on unique subjects. Existing studies adopt a
self-reconstructive perspective, focusing on capturing all details of a single
image, which will misconstrue the specific image's irrelevant attributes (e.g.,
view, pose, and background) as the subject intrinsic attributes. This
misconstruction leads to both overfitting or underfitting of irrelevant and
intrinsic attributes of the subject, i.e., these attributes are
over-represented or under-represented simultaneously, causing a trade-off
between similarity and controllability. In this study, we argue an ideal
subject representation can be achieved by a cross-differential perspective,
i.e., decoupling subject intrinsic attributes from irrelevant attributes via
contrastive learning, which allows the model to focus more on intrinsic
attributes through intra-consistency (features of the same subject are
spatially closer) and inter-distinctiveness (features of different subjects
have distinguished differences). Specifically, we propose CustomContrast, a
novel framework, which includes a Multilevel Contrastive Learning (MCL)
paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is
used to extract intrinsic features of subjects from high-level semantics to
low-level appearance through crossmodal semantic contrastive learning and
multiscale appearance contrastive learning. To facilitate contrastive learning,
we introduce the MFI encoder to capture cross-modal representations. Extensive
experiments show the effectiveness of CustomContrast in subject similarity and
text controllability.