针对视觉语言模型的多模态虚拟场景催眠越狱攻击

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-09-04 DOI:10.1016/j.patcog.2025.112391

Xiayang Shi , Shangfeng Chen , Gang Zhang , Wei Wei , Yinlin Li , Zhaoxin Fan , Jingjing Liu

{"title":"针对视觉语言模型的多模态虚拟场景催眠越狱攻击","authors":"Xiayang Shi , Shangfeng Chen , Gang Zhang , Wei Wei , Yinlin Li , Zhaoxin Fan , Jingjing Liu","doi":"10.1016/j.patcog.2025.112391","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the inherent vulnerabilities of large Vision-Language Models (VLMs), security governance has emerged as a critical concern, particularly given the risks posed by noisy and biased training data as well as adversarial attacks, including data poisoning and prompt injection. These perturbations can significantly degrade model performance and introduce multifaceted societal risks. To verify the safe robustness of VLMs and further inspire the design of defensive AI frameworks, we propose Virtual Scenario Hypnosis (VSH), a multimodal prompt injection jailbreak method that embeds malicious queries into prompts through a deceptive narrative framework. This approach strategically distracts the model while compromising its resistance to jailbreak attempts. Our methodology features two key innovations: 1) Targeted adversarial image prompts that transform textual content into visual layouts through optimized typographic designs, circumventing safety alignment mechanisms to elicit harmful responses; and 2) An information veil encrypted In-Context Learning (ICL) method for text prompts that systematically evades safety detection protocols. To streamline evaluation, we employ Large Language Models (LLMs) to facilitate an efficient assessment of jailbreak success rates, supported by a meticulously designed prompt template incorporating multi-dimensional scoring rules and evaluation metrics. Extensive experiments demonstrate the efficacy of VSH, achieving an overall success rate exceeding 82% on 500 harmful queries spanning multiple domains when tested against LLaVA-v1.5-13B and GPT-4o mini.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112391"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Jailbreak attack with multimodal virtual scenario hypnosis for vision-language models\",\"authors\":\"Xiayang Shi , Shangfeng Chen , Gang Zhang , Wei Wei , Yinlin Li , Zhaoxin Fan , Jingjing Liu\",\"doi\":\"10.1016/j.patcog.2025.112391\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Due to the inherent vulnerabilities of large Vision-Language Models (VLMs), security governance has emerged as a critical concern, particularly given the risks posed by noisy and biased training data as well as adversarial attacks, including data poisoning and prompt injection. These perturbations can significantly degrade model performance and introduce multifaceted societal risks. To verify the safe robustness of VLMs and further inspire the design of defensive AI frameworks, we propose Virtual Scenario Hypnosis (VSH), a multimodal prompt injection jailbreak method that embeds malicious queries into prompts through a deceptive narrative framework. This approach strategically distracts the model while compromising its resistance to jailbreak attempts. Our methodology features two key innovations: 1) Targeted adversarial image prompts that transform textual content into visual layouts through optimized typographic designs, circumventing safety alignment mechanisms to elicit harmful responses; and 2) An information veil encrypted In-Context Learning (ICL) method for text prompts that systematically evades safety detection protocols. To streamline evaluation, we employ Large Language Models (LLMs) to facilitate an efficient assessment of jailbreak success rates, supported by a meticulously designed prompt template incorporating multi-dimensional scoring rules and evaluation metrics. Extensive experiments demonstrate the efficacy of VSH, achieving an overall success rate exceeding 82% on 500 harmful queries spanning multiple domains when tested against LLaVA-v1.5-13B and GPT-4o mini.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"172 \",\"pages\":\"Article 112391\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325010520\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325010520","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于大型视觉语言模型（VLMs）固有的漏洞，安全治理已经成为一个关键问题，特别是考虑到噪声和有偏见的训练数据以及对抗性攻击（包括数据中毒和及时注入）所带来的风险。这些扰动会显著降低模型的性能，并引入多方面的社会风险。为了验证vlm的安全鲁棒性并进一步启发防御性AI框架的设计，我们提出了虚拟场景催眠（VSH），这是一种多模态提示注入越狱方法，通过欺骗性叙事框架将恶意查询嵌入提示中。这种方法在策略上分散了模型的注意力，同时损害了模型对越狱尝试的抵抗力。我们的方法有两个关键创新：1)有针对性的对抗性图像提示，通过优化的排版设计将文本内容转化为视觉布局，绕过安全对齐机制，引发有害反应；2)一种信息面纱加密的上下文学习（ICL）方法，用于系统地逃避安全检测协议的文本提示。为了简化评估，我们采用大型语言模型（llm）来促进对越狱成功率的有效评估，并辅以精心设计的包含多维评分规则和评估指标的提示模板。大量的实验证明了VSH的有效性，在针对LLaVA-v1.5-13B和gpt - 40 mini进行测试时，VSH在跨越多个域的500个有害查询中实现了超过82%的总体成功率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Jailbreak attack with multimodal virtual scenario hypnosis for vision-language models

查看原文本刊更多论文

Jailbreak attack with multimodal virtual scenario hypnosis for vision-language models

Due to the inherent vulnerabilities of large Vision-Language Models (VLMs), security governance has emerged as a critical concern, particularly given the risks posed by noisy and biased training data as well as adversarial attacks, including data poisoning and prompt injection. These perturbations can significantly degrade model performance and introduce multifaceted societal risks. To verify the safe robustness of VLMs and further inspire the design of defensive AI frameworks, we propose Virtual Scenario Hypnosis (VSH), a multimodal prompt injection jailbreak method that embeds malicious queries into prompts through a deceptive narrative framework. This approach strategically distracts the model while compromising its resistance to jailbreak attempts. Our methodology features two key innovations: 1) Targeted adversarial image prompts that transform textual content into visual layouts through optimized typographic designs, circumventing safety alignment mechanisms to elicit harmful responses; and 2) An information veil encrypted In-Context Learning (ICL) method for text prompts that systematically evades safety detection protocols. To streamline evaluation, we employ Large Language Models (LLMs) to facilitate an efficient assessment of jailbreak success rates, supported by a meticulously designed prompt template incorporating multi-dimensional scoring rules and evaluation metrics. Extensive experiments demonstrate the efficacy of VSH, achieving an overall success rate exceeding 82% on 500 harmful queries spanning multiple domains when tested against LLaVA-v1.5-13B and GPT-4o mini.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.