John Burden, Manuel Cebrian, Jose Hernandez-Orallo
{"title":"评估大型语言模型风险的对话复杂性","authors":"John Burden, Manuel Cebrian, Jose Hernandez-Orallo","doi":"arxiv-2409.01247","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) present a dual-use dilemma: they enable\nbeneficial applications while harboring potential for harm, particularly\nthrough conversational interactions. Despite various safeguards, advanced LLMs\nremain vulnerable. A watershed case was Kevin Roose's notable conversation with\nBing, which elicited harmful outputs after extended interaction. This contrasts\nwith simpler early jailbreaks that produced similar content more easily,\nraising the question: How much conversational effort is needed to elicit\nharmful information from LLMs? We propose two measures: Conversational Length\n(CL), which quantifies the conversation length used to obtain a specific\nresponse, and Conversational Complexity (CC), defined as the Kolmogorov\ncomplexity of the user's instruction sequence leading to the response. To\naddress the incomputability of Kolmogorov complexity, we approximate CC using a\nreference LLM to estimate the compressibility of user instructions. Applying\nthis approach to a large red-teaming dataset, we perform a quantitative\nanalysis examining the statistical distribution of harmful and harmless\nconversational lengths and complexities. Our empirical findings suggest that\nthis distributional analysis and the minimisation of CC serve as valuable tools\nfor understanding AI safety, offering insights into the accessibility of\nharmful information. This work establishes a foundation for a new perspective\non LLM safety, centered around the algorithmic complexity of pathways to harm.","PeriodicalId":501082,"journal":{"name":"arXiv - MATH - Information Theory","volume":"287 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Conversational Complexity for Assessing Risk in Large Language Models\",\"authors\":\"John Burden, Manuel Cebrian, Jose Hernandez-Orallo\",\"doi\":\"arxiv-2409.01247\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) present a dual-use dilemma: they enable\\nbeneficial applications while harboring potential for harm, particularly\\nthrough conversational interactions. Despite various safeguards, advanced LLMs\\nremain vulnerable. A watershed case was Kevin Roose's notable conversation with\\nBing, which elicited harmful outputs after extended interaction. This contrasts\\nwith simpler early jailbreaks that produced similar content more easily,\\nraising the question: How much conversational effort is needed to elicit\\nharmful information from LLMs? We propose two measures: Conversational Length\\n(CL), which quantifies the conversation length used to obtain a specific\\nresponse, and Conversational Complexity (CC), defined as the Kolmogorov\\ncomplexity of the user's instruction sequence leading to the response. To\\naddress the incomputability of Kolmogorov complexity, we approximate CC using a\\nreference LLM to estimate the compressibility of user instructions. Applying\\nthis approach to a large red-teaming dataset, we perform a quantitative\\nanalysis examining the statistical distribution of harmful and harmless\\nconversational lengths and complexities. Our empirical findings suggest that\\nthis distributional analysis and the minimisation of CC serve as valuable tools\\nfor understanding AI safety, offering insights into the accessibility of\\nharmful information. This work establishes a foundation for a new perspective\\non LLM safety, centered around the algorithmic complexity of pathways to harm.\",\"PeriodicalId\":501082,\"journal\":{\"name\":\"arXiv - MATH - Information Theory\",\"volume\":\"287 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - MATH - Information Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01247\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - MATH - Information Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Conversational Complexity for Assessing Risk in Large Language Models
Large Language Models (LLMs) present a dual-use dilemma: they enable
beneficial applications while harboring potential for harm, particularly
through conversational interactions. Despite various safeguards, advanced LLMs
remain vulnerable. A watershed case was Kevin Roose's notable conversation with
Bing, which elicited harmful outputs after extended interaction. This contrasts
with simpler early jailbreaks that produced similar content more easily,
raising the question: How much conversational effort is needed to elicit
harmful information from LLMs? We propose two measures: Conversational Length
(CL), which quantifies the conversation length used to obtain a specific
response, and Conversational Complexity (CC), defined as the Kolmogorov
complexity of the user's instruction sequence leading to the response. To
address the incomputability of Kolmogorov complexity, we approximate CC using a
reference LLM to estimate the compressibility of user instructions. Applying
this approach to a large red-teaming dataset, we perform a quantitative
analysis examining the statistical distribution of harmful and harmless
conversational lengths and complexities. Our empirical findings suggest that
this distributional analysis and the minimisation of CC serve as valuable tools
for understanding AI safety, offering insights into the accessibility of
harmful information. This work establishes a foundation for a new perspective
on LLM safety, centered around the algorithmic complexity of pathways to harm.