Self-Guidance: Boosting Flow and Diffusion Generation on Their Own.

IF 18.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Tiancheng Li,Weijian Luo,Zhiyang Chen,Liyuan Ma,Guo-Jun Qi
{"title":"Self-Guidance: Boosting Flow and Diffusion Generation on Their Own.","authors":"Tiancheng Li,Weijian Luo,Zhiyang Chen,Liyuan Ma,Guo-Jun Qi","doi":"10.1109/tpami.2025.3611831","DOIUrl":null,"url":null,"abstract":"Proper guidance strategies are essential to achieve high-quality generation results without retraining diffusion and flow-based text-to-image models. Existing guidance either requires specific training or strong inductive biases of diffusion model networks, which potentially limits their ability and application scope. Motivated by the observation that artifact outliers can be detected by a significant decline in the density from a noisier to a cleaner noise level, we propose Self-Guidance (SG), which can significantly improve the quality of the generated image by suppressing the generation of low-quality samples. The biggest difference from existing guidance is that SG only relies on the sampling score function of the original diffusion or flow model at different noise levels, with no need for any tricky and expensive guidance-specific training. This makes SG highly flexible to be used in a plug-and-play manner by any diffusion or flow models. We also introduce an efficient variant of SG, named SG-prev, which reuses the output from the immediately previous diffusion step to avoid additional forward passes of the diffusion network. We conduct extensive experiments on text-to-image and text-to-video generation with different architectures, including UNet and transformer models. With open-sourced diffusion models such as Stable Diffusion 3.5 and FLUX, SG exceeds existing algorithms on multiple metrics, including both FID and Human Preference Score. SG-prev also achieves strong results over both the baseline and the SG, with 50 percent more efficiency. Moreover, we find that SG and SG-prev both have a surprisingly positive effect on the generation of physiologically correct human body structures such as hands, faces, and arms, showing their ability to eliminate human body artifacts with minimal efforts. We have released our code at https://github.com/maple-research-lab/Self-Guidance.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"38 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3611831","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Proper guidance strategies are essential to achieve high-quality generation results without retraining diffusion and flow-based text-to-image models. Existing guidance either requires specific training or strong inductive biases of diffusion model networks, which potentially limits their ability and application scope. Motivated by the observation that artifact outliers can be detected by a significant decline in the density from a noisier to a cleaner noise level, we propose Self-Guidance (SG), which can significantly improve the quality of the generated image by suppressing the generation of low-quality samples. The biggest difference from existing guidance is that SG only relies on the sampling score function of the original diffusion or flow model at different noise levels, with no need for any tricky and expensive guidance-specific training. This makes SG highly flexible to be used in a plug-and-play manner by any diffusion or flow models. We also introduce an efficient variant of SG, named SG-prev, which reuses the output from the immediately previous diffusion step to avoid additional forward passes of the diffusion network. We conduct extensive experiments on text-to-image and text-to-video generation with different architectures, including UNet and transformer models. With open-sourced diffusion models such as Stable Diffusion 3.5 and FLUX, SG exceeds existing algorithms on multiple metrics, including both FID and Human Preference Score. SG-prev also achieves strong results over both the baseline and the SG, with 50 percent more efficiency. Moreover, we find that SG and SG-prev both have a surprisingly positive effect on the generation of physiologically correct human body structures such as hands, faces, and arms, showing their ability to eliminate human body artifacts with minimal efforts. We have released our code at https://github.com/maple-research-lab/Self-Guidance.
自我引导:促进流动和扩散产生自己。
适当的引导策略对于在不重新训练扩散和基于流的文本到图像模型的情况下获得高质量的生成结果至关重要。现有的导引要么需要特定的训练,要么需要扩散模型网络的强归纳偏差,这可能限制了它们的能力和应用范围。由于观察到从噪声水平到清洁噪声水平的密度显著下降可以检测到伪异常值,我们提出了自引导(SG),它可以通过抑制低质量样本的生成来显着提高生成图像的质量。与现有制导最大的不同是,SG只依赖于原始扩散或流动模型在不同噪声水平下的采样分数函数,不需要任何棘手和昂贵的制导特定训练。这使得SG高度灵活,可以通过任何扩散或流动模型以即插即用的方式使用。我们还引入了SG的一个有效变体,称为SG-prev,它重用了前一个扩散步骤的输出,以避免扩散网络的额外前向传递。我们使用不同的架构(包括UNet和transformer模型)对文本到图像和文本到视频的生成进行了广泛的实验。使用开源的扩散模型,如Stable diffusion 3.5和FLUX, SG在多个指标上超过了现有的算法,包括FID和Human Preference Score。SG-prev在基线和SG上都取得了良好的效果,效率提高了50%。此外,我们发现SG和SG-prev都对产生生理上正确的人体结构(如手、脸和手臂)有惊人的积极影响,表明它们能够以最小的努力消除人体伪影。我们已经在https://github.com/maple-research-lab/Self-Guidance上发布了我们的代码。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
28.40
自引率
3.00%
发文量
885
审稿时长
8.5 months
期刊介绍: The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信