{"title":"Jailbreaking Large Language Models with Symbolic Mathematics","authors":"Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, Peyman Najafirad","doi":"arxiv-2409.11445","DOIUrl":null,"url":null,"abstract":"Recent advancements in AI safety have led to increased efforts in training\nand red-teaming large language models (LLMs) to mitigate unsafe content\ngeneration. However, these safety mechanisms may not be comprehensive, leaving\npotential vulnerabilities unexplored. This paper introduces MathPrompt, a novel\njailbreaking technique that exploits LLMs' advanced capabilities in symbolic\nmathematics to bypass their safety mechanisms. By encoding harmful natural\nlanguage prompts into mathematical problems, we demonstrate a critical\nvulnerability in current AI safety measures. Our experiments across 13\nstate-of-the-art LLMs reveal an average attack success rate of 73.6\\%,\nhighlighting the inability of existing safety training mechanisms to generalize\nto mathematically encoded inputs. Analysis of embedding vectors shows a\nsubstantial semantic shift between original and encoded prompts, helping\nexplain the attack's success. This work emphasizes the importance of a holistic\napproach to AI safety, calling for expanded red-teaming efforts to develop\nrobust safeguards across all potential input types and their associated risks.","PeriodicalId":501332,"journal":{"name":"arXiv - CS - Cryptography and Security","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Cryptography and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11445","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advancements in AI safety have led to increased efforts in training
and red-teaming large language models (LLMs) to mitigate unsafe content
generation. However, these safety mechanisms may not be comprehensive, leaving
potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel
jailbreaking technique that exploits LLMs' advanced capabilities in symbolic
mathematics to bypass their safety mechanisms. By encoding harmful natural
language prompts into mathematical problems, we demonstrate a critical
vulnerability in current AI safety measures. Our experiments across 13
state-of-the-art LLMs reveal an average attack success rate of 73.6\%,
highlighting the inability of existing safety training mechanisms to generalize
to mathematically encoded inputs. Analysis of embedding vectors shows a
substantial semantic shift between original and encoded prompts, helping
explain the attack's success. This work emphasizes the importance of a holistic
approach to AI safety, calling for expanded red-teaming efforts to develop
robust safeguards across all potential input types and their associated risks.