生成式人工智能的空前发展：可信和恶意大型语言模型（LLMs）的实证分析

IF 1.9 4区工程技术 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Technology and Society Magazine Pub Date : 2025-07-23 DOI:10.1109/MTS.2025.3582667

Aditya K. Sood;Sherali Zeadally

{"title":"生成式人工智能的空前发展：可信和恶意大型语言模型（LLMs）的实证分析","authors":"Aditya K. Sood;Sherali Zeadally","doi":"10.1109/MTS.2025.3582667","DOIUrl":null,"url":null,"abstract":"Trusted large language models (LLMs) inherit ethical guidelines to prevent generating harmful content, whereas malicious LLMs are engineered to enable the generation of unethical and toxic responses. Both trusted and malicious LLMs use guardrails in differential contexts per the requirements of the developers and attackers, respectively. We explore the multifaceted world of guardrails implementation in LLMs by conducting an empirical analysis to assess the effectiveness of guardrails using prompts. Our results revealed that guardrails deployed in the trusted LLMs could be bypassed using prompt manipulation techniques such as “pretend” and “persist” to generate harmful content. In addition, we also discovered that malicious LLMs still deploy weak guardrails to evade detection by generating human-like content. This empirical analysis provides insights into the design of the malicious and trusted LLMs. We also propose recommendations to defend against prompt manipulation and guardrails bypass while designing LLMs.","PeriodicalId":55016,"journal":{"name":"IEEE Technology and Society Magazine","volume":"44 3","pages":"98-108"},"PeriodicalIF":1.9000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Unprecedented Surge in Generative AI: Empirical Analysis of Trusted and Malicious Large Language Models (LLMs)\",\"authors\":\"Aditya K. Sood;Sherali Zeadally\",\"doi\":\"10.1109/MTS.2025.3582667\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Trusted large language models (LLMs) inherit ethical guidelines to prevent generating harmful content, whereas malicious LLMs are engineered to enable the generation of unethical and toxic responses. Both trusted and malicious LLMs use guardrails in differential contexts per the requirements of the developers and attackers, respectively. We explore the multifaceted world of guardrails implementation in LLMs by conducting an empirical analysis to assess the effectiveness of guardrails using prompts. Our results revealed that guardrails deployed in the trusted LLMs could be bypassed using prompt manipulation techniques such as “pretend” and “persist” to generate harmful content. In addition, we also discovered that malicious LLMs still deploy weak guardrails to evade detection by generating human-like content. This empirical analysis provides insights into the design of the malicious and trusted LLMs. We also propose recommendations to defend against prompt manipulation and guardrails bypass while designing LLMs.\",\"PeriodicalId\":55016,\"journal\":{\"name\":\"IEEE Technology and Society Magazine\",\"volume\":\"44 3\",\"pages\":\"98-108\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Technology and Society Magazine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11091436/\",\"RegionNum\":4,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Technology and Society Magazine","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11091436/","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

可信的大型语言模型（llm）继承了道德准则，以防止产生有害的内容，而恶意的llm被设计成能够产生不道德和有毒的响应。可信的和恶意的llm分别根据开发人员和攻击者的需求在不同的上下文中使用护栏。我们通过进行实证分析来评估使用提示的护栏的有效性，探索了法学硕士护栏实施的多方面世界。我们的研究结果表明，部署在可信llm中的护栏可以通过使用“假装”和“坚持”等即时操纵技术来绕过，从而产生有害内容。此外，我们还发现恶意llm仍然部署薄弱的护栏，通过生成类似人类的内容来逃避检测。这一实证分析为恶意和可信llm的设计提供了见解。我们还提出了在设计llm时防止即时操纵和护栏绕过的建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Unprecedented Surge in Generative AI: Empirical Analysis of Trusted and Malicious Large Language Models (LLMs)

Trusted large language models (LLMs) inherit ethical guidelines to prevent generating harmful content, whereas malicious LLMs are engineered to enable the generation of unethical and toxic responses. Both trusted and malicious LLMs use guardrails in differential contexts per the requirements of the developers and attackers, respectively. We explore the multifaceted world of guardrails implementation in LLMs by conducting an empirical analysis to assess the effectiveness of guardrails using prompts. Our results revealed that guardrails deployed in the trusted LLMs could be bypassed using prompt manipulation techniques such as “pretend” and “persist” to generate harmful content. In addition, we also discovered that malicious LLMs still deploy weak guardrails to evade detection by generating human-like content. This empirical analysis provides insights into the design of the malicious and trusted LLMs. We also propose recommendations to defend against prompt manipulation and guardrails bypass while designing LLMs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Technology and Society Magazine 工程技术-工程：电子与电气

CiteScore

3.00

自引率

13.60%

发文量

审稿时长

>12 weeks

期刊介绍： IEEE Technology and Society Magazine invites feature articles (refereed), special articles, and commentaries on topics within the scope of the IEEE Society on Social Implications of Technology, in the broad areas of social implications of electrotechnology, history of electrotechnology, and engineering ethics.