Multi-View User Preference Modeling for Personalized Text-to-Image Generation

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI:10.1109/TMM.2025.3557683

Huaiwen Zhang;Tianci Wu;Yinwei Wei

{"title":"Multi-View User Preference Modeling for Personalized Text-to-Image Generation","authors":"Huaiwen Zhang;Tianci Wu;Yinwei Wei","doi":"10.1109/TMM.2025.3557683","DOIUrl":null,"url":null,"abstract":"Personalized text-to-image generation aims to synthesize images tailored to individual user preferences. Existing methods primarily generate customized content using a few reference images, which often struggle to mine user preferences from historical records, and thus fail to synthesize truly personalized content. In addition, it is difficult to directly incorporate the extracted feature of user preferences into the feature space of the generation model, since there exists a considerable gap between them. In this paper, we propose a novel multi-view personalized text-to-image generation method based on the diffusion model, named MVP-Diffusion, which learns instance- and user-level preferences from historical records and integrates them into the generation model. For instance-level user preference modeling, we employ a chain-of-thought prompting strategy to deduce preference keywords and integrate them into input prompts with the aid of a large language model. For user-level preference modeling, we construct a learnable embedding for each user to capture more comprehensive preferences by analyzing their historical records. An adaptive user preference fusion module is proposed to inject user preferences into the generation model via a set of learnable parameters. Experimental results demonstrate that the proposed method significantly enhances the personalization of the generated images compared to the other personalized text-to-image generation methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3082-3091"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948278/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Personalized text-to-image generation aims to synthesize images tailored to individual user preferences. Existing methods primarily generate customized content using a few reference images, which often struggle to mine user preferences from historical records, and thus fail to synthesize truly personalized content. In addition, it is difficult to directly incorporate the extracted feature of user preferences into the feature space of the generation model, since there exists a considerable gap between them. In this paper, we propose a novel multi-view personalized text-to-image generation method based on the diffusion model, named MVP-Diffusion, which learns instance- and user-level preferences from historical records and integrates them into the generation model. For instance-level user preference modeling, we employ a chain-of-thought prompting strategy to deduce preference keywords and integrate them into input prompts with the aid of a large language model. For user-level preference modeling, we construct a learnable embedding for each user to capture more comprehensive preferences by analyzing their historical records. An adaptive user preference fusion module is proposed to inject user preferences into the generation model via a set of learnable parameters. Experimental results demonstrate that the proposed method significantly enhances the personalization of the generated images compared to the other personalized text-to-image generation methods.

查看原文本刊更多论文

个性化文本到图像生成的多视图用户偏好建模

个性化的文本到图像生成旨在合成适合个人用户偏好的图像。现有的方法主要是使用少量参考图像生成定制内容，这些方法往往难以从历史记录中挖掘用户偏好，因此无法合成真正个性化的内容。此外，提取的用户偏好特征很难直接纳入生成模型的特征空间，因为它们之间存在相当大的差距。本文提出了一种新的基于扩散模型的多视图个性化文本到图像生成方法，称为MVP-Diffusion，该方法从历史记录中学习实例级和用户级偏好，并将其集成到生成模型中。对于实例级用户偏好建模，我们采用思维链提示策略来推断偏好关键字，并借助大型语言模型将其集成到输入提示中。对于用户级偏好建模，我们为每个用户构建一个可学习的嵌入，通过分析他们的历史记录来捕获更全面的偏好。提出了一种自适应用户偏好融合模块，通过一组可学习参数将用户偏好注入生成模型。实验结果表明，与其他个性化文本到图像生成方法相比，该方法显著增强了生成图像的个性化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.