休眠键：解锁文本到图像模型中的通用对抗性控制。

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-09-03 DOI:10.1016/j.neunet.2025.108065

Jingqi Hu, Li Li, Hanzhou Wu, Huixin Luo, Xinpeng Zhang

{"title":"休眠键：解锁文本到图像模型中的通用对抗性控制。","authors":"Jingqi Hu, Li Li, Hanzhou Wu, Huixin Luo, Xinpeng Zhang","doi":"10.1016/j.neunet.2025.108065","DOIUrl":null,"url":null,"abstract":"Text-to-Image (T2I) diffusion models have gained significant traction due to their remarkable image generation capabilities, raising growing concerns over the security risks associated with their use. Prior studies have shown that malicious users can subtly modify prompts to produce visually misleading or Not-Safe-For-Work (NSFW) content, even bypassing existing safety filters. Existing adversarial attacks are often optimized for specific prompts, limiting their generalizability, and their text-space perturbations are easily detectable by current defenses. To address these limitations, we propose a universal adversarial attack framework called dormant key. It appends a transferable suffix that can be appended as a \"plug-in\" to any text input to guide the generated image toward a specific target. To ensure robustness across diverse prompts, we introduce a novel hierarchical gradient aggregation strategy that stabilizes optimization over prompt batches. This enables efficient learning of universal perturbations in the text space, improving both attack transferability and imperceptibility. Experimental results show that our method effectively balances attack performance and stealth. In NSFW generation tasks, it bypasses major safety mechanisms, including keyword filtering, semantic analysis, and text classifiers, and achieves over 18 % improvement in success rate over baselines.","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"193 ","pages":"108065"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dormant key: Unlocking universal adversarial control in text-to-image models.\",\"authors\":\"Jingqi Hu, Li Li, Hanzhou Wu, Huixin Luo, Xinpeng Zhang\",\"doi\":\"10.1016/j.neunet.2025.108065\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-to-Image (T2I) diffusion models have gained significant traction due to their remarkable image generation capabilities, raising growing concerns over the security risks associated with their use. Prior studies have shown that malicious users can subtly modify prompts to produce visually misleading or Not-Safe-For-Work (NSFW) content, even bypassing existing safety filters. Existing adversarial attacks are often optimized for specific prompts, limiting their generalizability, and their text-space perturbations are easily detectable by current defenses. To address these limitations, we propose a universal adversarial attack framework called dormant key. It appends a transferable suffix that can be appended as a \\\"plug-in\\\" to any text input to guide the generated image toward a specific target. To ensure robustness across diverse prompts, we introduce a novel hierarchical gradient aggregation strategy that stabilizes optimization over prompt batches. This enables efficient learning of universal perturbations in the text space, improving both attack transferability and imperceptibility. Experimental results show that our method effectively balances attack performance and stealth. In NSFW generation tasks, it bypasses major safety mechanisms, including keyword filtering, semantic analysis, and text classifiers, and achieves over 18 % improvement in success rate over baselines.\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"193 \",\"pages\":\"108065\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1016/j.neunet.2025.108065\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.neunet.2025.108065","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

文本到图像（tt2i）扩散模型由于其卓越的图像生成能力而获得了显著的吸引力，引起了对其使用相关安全风险的日益关注。先前的研究表明，恶意用户可以巧妙地修改提示，以产生视觉误导或不安全的工作（NSFW）内容，甚至绕过现有的安全过滤器。现有的对抗性攻击通常针对特定的提示进行优化，限制了它们的泛化性，并且它们的文本空间扰动很容易被当前的防御检测到。为了解决这些限制，我们提出了一个通用的对抗性攻击框架，称为休眠密钥。它附加了一个可转移的后缀，该后缀可以作为“插件”附加到任何文本输入，以引导生成的图像指向特定目标。为了确保不同提示的鲁棒性，我们引入了一种新的分层梯度聚合策略，该策略可以在提示批次上稳定优化。这使得有效地学习文本空间中的普遍扰动，提高攻击的可转移性和不可感知性。实验结果表明，该方法有效地平衡了攻击性能和隐身性。在NSFW生成任务中，它绕过了主要的安全机制，包括关键字过滤、语义分析和文本分类器，成功率比基线提高了18%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dormant key: Unlocking universal adversarial control in text-to-image models.

Text-to-Image (T2I) diffusion models have gained significant traction due to their remarkable image generation capabilities, raising growing concerns over the security risks associated with their use. Prior studies have shown that malicious users can subtly modify prompts to produce visually misleading or Not-Safe-For-Work (NSFW) content, even bypassing existing safety filters. Existing adversarial attacks are often optimized for specific prompts, limiting their generalizability, and their text-space perturbations are easily detectable by current defenses. To address these limitations, we propose a universal adversarial attack framework called dormant key. It appends a transferable suffix that can be appended as a "plug-in" to any text input to guide the generated image toward a specific target. To ensure robustness across diverse prompts, we introduce a novel hierarchical gradient aggregation strategy that stabilizes optimization over prompt batches. This enables efficient learning of universal perturbations in the text space, improving both attack transferability and imperceptibility. Experimental results show that our method effectively balances attack performance and stealth. In NSFW generation tasks, it bypasses major safety mechanisms, including keyword filtering, semantic analysis, and text classifiers, and achieves over 18 % improvement in success rate over baselines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.