Aligning Text-to-Image Diffusion Models with Constrained Reinforcement Learning.

IF 20.8 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-07-18 DOI:10.1109/tpami.2025.3590730

Ziyi Zhang,Sen Zhang,Li Shen,Yibing Zhan,Yong Luo,Han Hu,Bo Du,Yonggang Wen,Dacheng Tao

{"title":"Aligning Text-to-Image Diffusion Models with Constrained Reinforcement Learning.","authors":"Ziyi Zhang,Sen Zhang,Li Shen,Yibing Zhan,Yong Luo,Han Hu,Bo Du,Yonggang Wen,Dacheng Tao","doi":"10.1109/tpami.2025.3590730","DOIUrl":null,"url":null,"abstract":"Reward finetuning has emerged as a powerful technique for aligning diffusion models with specific downstream objectives or user preferences. However, current approaches suffer from a persistent challenge of reward overoptimization, where models exploit imperfect reward feedback at the expense of overall performance. In this work, we identify three key contributors to overoptimization: (1) a granularity mismatch between the multi-step diffusion process and sparse rewards; (2) a loss of plasticity that limits the model's ability to adapt and generalize; and (3) an overly narrow focus on a single reward objective that neglects complementary performance criteria. Accordingly, we introduce Constrained Diffusion Policy Optimization (CDPO), a novel reinforcement learning framework that addresses reward overoptimization from multiple angles. Firstly, CDPO tackles the granularity mismatch through a temporal policy optimization strategy that delivers step-specific rewards throughout the entire diffusion trajectory, thereby reducing the risk of overfitting to sparse final-step rewards. Then we incorporate a neuron reset strategy that selectively resets overactive neurons in the model, preventing overoptimization induced by plasticity loss. Finally, to avoid overfitting to a narrow reward objective, we integrate constrained reinforcement learning with auxiliary reward objectives serving as explicit constraints, ensuring a balanced optimization across diverse performance metrics.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"73 4 1","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3590730","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Reward finetuning has emerged as a powerful technique for aligning diffusion models with specific downstream objectives or user preferences. However, current approaches suffer from a persistent challenge of reward overoptimization, where models exploit imperfect reward feedback at the expense of overall performance. In this work, we identify three key contributors to overoptimization: (1) a granularity mismatch between the multi-step diffusion process and sparse rewards; (2) a loss of plasticity that limits the model's ability to adapt and generalize; and (3) an overly narrow focus on a single reward objective that neglects complementary performance criteria. Accordingly, we introduce Constrained Diffusion Policy Optimization (CDPO), a novel reinforcement learning framework that addresses reward overoptimization from multiple angles. Firstly, CDPO tackles the granularity mismatch through a temporal policy optimization strategy that delivers step-specific rewards throughout the entire diffusion trajectory, thereby reducing the risk of overfitting to sparse final-step rewards. Then we incorporate a neuron reset strategy that selectively resets overactive neurons in the model, preventing overoptimization induced by plasticity loss. Finally, to avoid overfitting to a narrow reward objective, we integrate constrained reinforcement learning with auxiliary reward objectives serving as explicit constraints, ensuring a balanced optimization across diverse performance metrics.

查看原文本刊更多论文

用约束强化学习对齐文本到图像的扩散模型。

奖励微调已经成为一种强大的技术，用于将扩散模型与特定的下游目标或用户偏好对齐。然而，当前的方法受到奖励过度优化的持续挑战，即模型以牺牲整体性能为代价利用不完善的奖励反馈。在这项工作中，我们确定了过度优化的三个关键因素：(1)多步骤扩散过程和稀疏奖励之间的粒度不匹配；(2)可塑性的丧失限制了模型的适应和推广能力；(3)过于狭隘地关注单一的奖励目标，而忽视了互补的绩效标准。因此，我们引入了约束扩散策略优化（CDPO），这是一种新的强化学习框架，从多个角度解决奖励过度优化问题。首先，CDPO通过一种时间策略优化策略来解决粒度不匹配问题，该策略在整个扩散轨迹中提供特定于步骤的奖励，从而降低了对稀疏的最后一步奖励的过拟合风险。然后，我们结合了一种神经元重置策略，选择性地重置模型中过度活跃的神经元，防止可塑性损失引起的过度优化。最后，为了避免过度拟合狭窄的奖励目标，我们将约束强化学习与辅助奖励目标相结合，作为明确的约束，确保在不同绩效指标之间实现平衡优化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.