Aligning Text-to-Image Diffusion Models with Constrained Reinforcement Learning.

IF 20.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Ziyi Zhang,Sen Zhang,Li Shen,Yibing Zhan,Yong Luo,Han Hu,Bo Du,Yonggang Wen,Dacheng Tao
{"title":"Aligning Text-to-Image Diffusion Models with Constrained Reinforcement Learning.","authors":"Ziyi Zhang,Sen Zhang,Li Shen,Yibing Zhan,Yong Luo,Han Hu,Bo Du,Yonggang Wen,Dacheng Tao","doi":"10.1109/tpami.2025.3590730","DOIUrl":null,"url":null,"abstract":"Reward finetuning has emerged as a powerful technique for aligning diffusion models with specific downstream objectives or user preferences. However, current approaches suffer from a persistent challenge of reward overoptimization, where models exploit imperfect reward feedback at the expense of overall performance. In this work, we identify three key contributors to overoptimization: (1) a granularity mismatch between the multi-step diffusion process and sparse rewards; (2) a loss of plasticity that limits the model's ability to adapt and generalize; and (3) an overly narrow focus on a single reward objective that neglects complementary performance criteria. Accordingly, we introduce Constrained Diffusion Policy Optimization (CDPO), a novel reinforcement learning framework that addresses reward overoptimization from multiple angles. Firstly, CDPO tackles the granularity mismatch through a temporal policy optimization strategy that delivers step-specific rewards throughout the entire diffusion trajectory, thereby reducing the risk of overfitting to sparse final-step rewards. Then we incorporate a neuron reset strategy that selectively resets overactive neurons in the model, preventing overoptimization induced by plasticity loss. Finally, to avoid overfitting to a narrow reward objective, we integrate constrained reinforcement learning with auxiliary reward objectives serving as explicit constraints, ensuring a balanced optimization across diverse performance metrics.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"73 4 1","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3590730","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Reward finetuning has emerged as a powerful technique for aligning diffusion models with specific downstream objectives or user preferences. However, current approaches suffer from a persistent challenge of reward overoptimization, where models exploit imperfect reward feedback at the expense of overall performance. In this work, we identify three key contributors to overoptimization: (1) a granularity mismatch between the multi-step diffusion process and sparse rewards; (2) a loss of plasticity that limits the model's ability to adapt and generalize; and (3) an overly narrow focus on a single reward objective that neglects complementary performance criteria. Accordingly, we introduce Constrained Diffusion Policy Optimization (CDPO), a novel reinforcement learning framework that addresses reward overoptimization from multiple angles. Firstly, CDPO tackles the granularity mismatch through a temporal policy optimization strategy that delivers step-specific rewards throughout the entire diffusion trajectory, thereby reducing the risk of overfitting to sparse final-step rewards. Then we incorporate a neuron reset strategy that selectively resets overactive neurons in the model, preventing overoptimization induced by plasticity loss. Finally, to avoid overfitting to a narrow reward objective, we integrate constrained reinforcement learning with auxiliary reward objectives serving as explicit constraints, ensuring a balanced optimization across diverse performance metrics.
用约束强化学习对齐文本到图像的扩散模型。
奖励微调已经成为一种强大的技术,用于将扩散模型与特定的下游目标或用户偏好对齐。然而,当前的方法受到奖励过度优化的持续挑战,即模型以牺牲整体性能为代价利用不完善的奖励反馈。在这项工作中,我们确定了过度优化的三个关键因素:(1)多步骤扩散过程和稀疏奖励之间的粒度不匹配;(2)可塑性的丧失限制了模型的适应和推广能力;(3)过于狭隘地关注单一的奖励目标,而忽视了互补的绩效标准。因此,我们引入了约束扩散策略优化(CDPO),这是一种新的强化学习框架,从多个角度解决奖励过度优化问题。首先,CDPO通过一种时间策略优化策略来解决粒度不匹配问题,该策略在整个扩散轨迹中提供特定于步骤的奖励,从而降低了对稀疏的最后一步奖励的过拟合风险。然后,我们结合了一种神经元重置策略,选择性地重置模型中过度活跃的神经元,防止可塑性损失引起的过度优化。最后,为了避免过度拟合狭窄的奖励目标,我们将约束强化学习与辅助奖励目标相结合,作为明确的约束,确保在不同绩效指标之间实现平衡优化。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
28.40
自引率
3.00%
发文量
885
审稿时长
8.5 months
期刊介绍: The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信