Jailbreaking Large Language Models with Symbolic Mathematics

Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, Peyman Najafirad
{"title":"Jailbreaking Large Language Models with Symbolic Mathematics","authors":"Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, Peyman Najafirad","doi":"arxiv-2409.11445","DOIUrl":null,"url":null,"abstract":"Recent advancements in AI safety have led to increased efforts in training\nand red-teaming large language models (LLMs) to mitigate unsafe content\ngeneration. However, these safety mechanisms may not be comprehensive, leaving\npotential vulnerabilities unexplored. This paper introduces MathPrompt, a novel\njailbreaking technique that exploits LLMs' advanced capabilities in symbolic\nmathematics to bypass their safety mechanisms. By encoding harmful natural\nlanguage prompts into mathematical problems, we demonstrate a critical\nvulnerability in current AI safety measures. Our experiments across 13\nstate-of-the-art LLMs reveal an average attack success rate of 73.6\\%,\nhighlighting the inability of existing safety training mechanisms to generalize\nto mathematically encoded inputs. Analysis of embedding vectors shows a\nsubstantial semantic shift between original and encoded prompts, helping\nexplain the attack's success. This work emphasizes the importance of a holistic\napproach to AI safety, calling for expanded red-teaming efforts to develop\nrobust safeguards across all potential input types and their associated risks.","PeriodicalId":501332,"journal":{"name":"arXiv - CS - Cryptography and Security","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Cryptography and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11445","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6\%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack's success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.
用符号数学破解大型语言模型
人工智能安全领域的最新进展促使人们加大了对大型语言模型(LLMs)的训练和重新组队的力度,以减少不安全内容的生成。然而,这些安全机制可能并不全面,导致潜在漏洞未被发掘。本文介绍的 MathPrompt 是一种新颖的越狱技术,它利用 LLM 在符号数学方面的高级能力绕过其安全机制。通过将有害的自然语言提示编码成数学问题,我们展示了当前人工智能安全措施中的一个关键漏洞。我们在13种最先进的LLM中进行的实验显示,平均攻击成功率为73.6%,这突出表明现有的安全训练机制无法泛化到数学编码的输入。对嵌入向量的分析表明,原始提示和编码提示之间存在实质性的语义变化,这有助于解释攻击成功的原因。这项工作强调了整体方法对人工智能安全的重要性,呼吁扩大红队的工作范围,为所有潜在输入类型及其相关风险开发可靠的保障措施。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信