Self-Guidance: Boosting Flow and Diffusion Generation on Their Own.

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-09-18 DOI:10.1109/tpami.2025.3611831

Tiancheng Li,Weijian Luo,Zhiyang Chen,Liyuan Ma,Guo-Jun Qi

{"title":"Self-Guidance: Boosting Flow and Diffusion Generation on Their Own.","authors":"Tiancheng Li,Weijian Luo,Zhiyang Chen,Liyuan Ma,Guo-Jun Qi","doi":"10.1109/tpami.2025.3611831","DOIUrl":null,"url":null,"abstract":"Proper guidance strategies are essential to achieve high-quality generation results without retraining diffusion and flow-based text-to-image models. Existing guidance either requires specific training or strong inductive biases of diffusion model networks, which potentially limits their ability and application scope. Motivated by the observation that artifact outliers can be detected by a significant decline in the density from a noisier to a cleaner noise level, we propose Self-Guidance (SG), which can significantly improve the quality of the generated image by suppressing the generation of low-quality samples. The biggest difference from existing guidance is that SG only relies on the sampling score function of the original diffusion or flow model at different noise levels, with no need for any tricky and expensive guidance-specific training. This makes SG highly flexible to be used in a plug-and-play manner by any diffusion or flow models. We also introduce an efficient variant of SG, named SG-prev, which reuses the output from the immediately previous diffusion step to avoid additional forward passes of the diffusion network. We conduct extensive experiments on text-to-image and text-to-video generation with different architectures, including UNet and transformer models. With open-sourced diffusion models such as Stable Diffusion 3.5 and FLUX, SG exceeds existing algorithms on multiple metrics, including both FID and Human Preference Score. SG-prev also achieves strong results over both the baseline and the SG, with 50 percent more efficiency. Moreover, we find that SG and SG-prev both have a surprisingly positive effect on the generation of physiologically correct human body structures such as hands, faces, and arms, showing their ability to eliminate human body artifacts with minimal efforts. We have released our code at https://github.com/maple-research-lab/Self-Guidance.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"38 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3611831","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Proper guidance strategies are essential to achieve high-quality generation results without retraining diffusion and flow-based text-to-image models. Existing guidance either requires specific training or strong inductive biases of diffusion model networks, which potentially limits their ability and application scope. Motivated by the observation that artifact outliers can be detected by a significant decline in the density from a noisier to a cleaner noise level, we propose Self-Guidance (SG), which can significantly improve the quality of the generated image by suppressing the generation of low-quality samples. The biggest difference from existing guidance is that SG only relies on the sampling score function of the original diffusion or flow model at different noise levels, with no need for any tricky and expensive guidance-specific training. This makes SG highly flexible to be used in a plug-and-play manner by any diffusion or flow models. We also introduce an efficient variant of SG, named SG-prev, which reuses the output from the immediately previous diffusion step to avoid additional forward passes of the diffusion network. We conduct extensive experiments on text-to-image and text-to-video generation with different architectures, including UNet and transformer models. With open-sourced diffusion models such as Stable Diffusion 3.5 and FLUX, SG exceeds existing algorithms on multiple metrics, including both FID and Human Preference Score. SG-prev also achieves strong results over both the baseline and the SG, with 50 percent more efficiency. Moreover, we find that SG and SG-prev both have a surprisingly positive effect on the generation of physiologically correct human body structures such as hands, faces, and arms, showing their ability to eliminate human body artifacts with minimal efforts. We have released our code at https://github.com/maple-research-lab/Self-Guidance.

查看原文本刊更多论文

自我引导：促进流动和扩散产生自己。

适当的引导策略对于在不重新训练扩散和基于流的文本到图像模型的情况下获得高质量的生成结果至关重要。现有的导引要么需要特定的训练，要么需要扩散模型网络的强归纳偏差，这可能限制了它们的能力和应用范围。由于观察到从噪声水平到清洁噪声水平的密度显著下降可以检测到伪异常值，我们提出了自引导（SG），它可以通过抑制低质量样本的生成来显着提高生成图像的质量。与现有制导最大的不同是，SG只依赖于原始扩散或流动模型在不同噪声水平下的采样分数函数，不需要任何棘手和昂贵的制导特定训练。这使得SG高度灵活，可以通过任何扩散或流动模型以即插即用的方式使用。我们还引入了SG的一个有效变体，称为SG-prev，它重用了前一个扩散步骤的输出，以避免扩散网络的额外前向传递。我们使用不同的架构（包括UNet和transformer模型）对文本到图像和文本到视频的生成进行了广泛的实验。使用开源的扩散模型，如Stable diffusion 3.5和FLUX， SG在多个指标上超过了现有的算法，包括FID和Human Preference Score。SG-prev在基线和SG上都取得了良好的效果，效率提高了50%。此外，我们发现SG和SG-prev都对产生生理上正确的人体结构（如手、脸和手臂）有惊人的积极影响，表明它们能够以最小的努力消除人体伪影。我们已经在https://github.com/maple-research-lab/Self-Guidance上发布了我们的代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.