休眠键:解锁文本到图像模型中的通用对抗性控制。

IF 6.3 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jingqi Hu, Li Li, Hanzhou Wu, Huixin Luo, Xinpeng Zhang
{"title":"休眠键:解锁文本到图像模型中的通用对抗性控制。","authors":"Jingqi Hu, Li Li, Hanzhou Wu, Huixin Luo, Xinpeng Zhang","doi":"10.1016/j.neunet.2025.108065","DOIUrl":null,"url":null,"abstract":"<p><p>Text-to-Image (T2I) diffusion models have gained significant traction due to their remarkable image generation capabilities, raising growing concerns over the security risks associated with their use. Prior studies have shown that malicious users can subtly modify prompts to produce visually misleading or Not-Safe-For-Work (NSFW) content, even bypassing existing safety filters. Existing adversarial attacks are often optimized for specific prompts, limiting their generalizability, and their text-space perturbations are easily detectable by current defenses. To address these limitations, we propose a universal adversarial attack framework called dormant key. It appends a transferable suffix that can be appended as a \"plug-in\" to any text input to guide the generated image toward a specific target. To ensure robustness across diverse prompts, we introduce a novel hierarchical gradient aggregation strategy that stabilizes optimization over prompt batches. This enables efficient learning of universal perturbations in the text space, improving both attack transferability and imperceptibility. Experimental results show that our method effectively balances attack performance and stealth. In NSFW generation tasks, it bypasses major safety mechanisms, including keyword filtering, semantic analysis, and text classifiers, and achieves over 18 % improvement in success rate over baselines.</p>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"193 ","pages":"108065"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dormant key: Unlocking universal adversarial control in text-to-image models.\",\"authors\":\"Jingqi Hu, Li Li, Hanzhou Wu, Huixin Luo, Xinpeng Zhang\",\"doi\":\"10.1016/j.neunet.2025.108065\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Text-to-Image (T2I) diffusion models have gained significant traction due to their remarkable image generation capabilities, raising growing concerns over the security risks associated with their use. Prior studies have shown that malicious users can subtly modify prompts to produce visually misleading or Not-Safe-For-Work (NSFW) content, even bypassing existing safety filters. Existing adversarial attacks are often optimized for specific prompts, limiting their generalizability, and their text-space perturbations are easily detectable by current defenses. To address these limitations, we propose a universal adversarial attack framework called dormant key. It appends a transferable suffix that can be appended as a \\\"plug-in\\\" to any text input to guide the generated image toward a specific target. To ensure robustness across diverse prompts, we introduce a novel hierarchical gradient aggregation strategy that stabilizes optimization over prompt batches. This enables efficient learning of universal perturbations in the text space, improving both attack transferability and imperceptibility. Experimental results show that our method effectively balances attack performance and stealth. In NSFW generation tasks, it bypasses major safety mechanisms, including keyword filtering, semantic analysis, and text classifiers, and achieves over 18 % improvement in success rate over baselines.</p>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"193 \",\"pages\":\"108065\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1016/j.neunet.2025.108065\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.neunet.2025.108065","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

文本到图像(tt2i)扩散模型由于其卓越的图像生成能力而获得了显著的吸引力,引起了对其使用相关安全风险的日益关注。先前的研究表明,恶意用户可以巧妙地修改提示,以产生视觉误导或不安全的工作(NSFW)内容,甚至绕过现有的安全过滤器。现有的对抗性攻击通常针对特定的提示进行优化,限制了它们的泛化性,并且它们的文本空间扰动很容易被当前的防御检测到。为了解决这些限制,我们提出了一个通用的对抗性攻击框架,称为休眠密钥。它附加了一个可转移的后缀,该后缀可以作为“插件”附加到任何文本输入,以引导生成的图像指向特定目标。为了确保不同提示的鲁棒性,我们引入了一种新的分层梯度聚合策略,该策略可以在提示批次上稳定优化。这使得有效地学习文本空间中的普遍扰动,提高攻击的可转移性和不可感知性。实验结果表明,该方法有效地平衡了攻击性能和隐身性。在NSFW生成任务中,它绕过了主要的安全机制,包括关键字过滤、语义分析和文本分类器,成功率比基线提高了18%以上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dormant key: Unlocking universal adversarial control in text-to-image models.

Text-to-Image (T2I) diffusion models have gained significant traction due to their remarkable image generation capabilities, raising growing concerns over the security risks associated with their use. Prior studies have shown that malicious users can subtly modify prompts to produce visually misleading or Not-Safe-For-Work (NSFW) content, even bypassing existing safety filters. Existing adversarial attacks are often optimized for specific prompts, limiting their generalizability, and their text-space perturbations are easily detectable by current defenses. To address these limitations, we propose a universal adversarial attack framework called dormant key. It appends a transferable suffix that can be appended as a "plug-in" to any text input to guide the generated image toward a specific target. To ensure robustness across diverse prompts, we introduce a novel hierarchical gradient aggregation strategy that stabilizes optimization over prompt batches. This enables efficient learning of universal perturbations in the text space, improving both attack transferability and imperceptibility. Experimental results show that our method effectively balances attack performance and stealth. In NSFW generation tasks, it bypasses major safety mechanisms, including keyword filtering, semantic analysis, and text classifiers, and achieves over 18 % improvement in success rate over baselines.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Neural Networks
Neural Networks 工程技术-计算机:人工智能
CiteScore
13.90
自引率
7.70%
发文量
425
审稿时长
67 days
期刊介绍: Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信