{"title":"生成式人工智能的空前发展:可信和恶意大型语言模型(LLMs)的实证分析","authors":"Aditya K. Sood;Sherali Zeadally","doi":"10.1109/MTS.2025.3582667","DOIUrl":null,"url":null,"abstract":"Trusted large language models (LLMs) inherit ethical guidelines to prevent generating harmful content, whereas malicious LLMs are engineered to enable the generation of unethical and toxic responses. Both trusted and malicious LLMs use guardrails in differential contexts per the requirements of the developers and attackers, respectively. We explore the multifaceted world of guardrails implementation in LLMs by conducting an empirical analysis to assess the effectiveness of guardrails using prompts. Our results revealed that guardrails deployed in the trusted LLMs could be bypassed using prompt manipulation techniques such as “pretend” and “persist” to generate harmful content. In addition, we also discovered that malicious LLMs still deploy weak guardrails to evade detection by generating human-like content. This empirical analysis provides insights into the design of the malicious and trusted LLMs. We also propose recommendations to defend against prompt manipulation and guardrails bypass while designing LLMs.","PeriodicalId":55016,"journal":{"name":"IEEE Technology and Society Magazine","volume":"44 3","pages":"98-108"},"PeriodicalIF":1.9000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Unprecedented Surge in Generative AI: Empirical Analysis of Trusted and Malicious Large Language Models (LLMs)\",\"authors\":\"Aditya K. Sood;Sherali Zeadally\",\"doi\":\"10.1109/MTS.2025.3582667\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Trusted large language models (LLMs) inherit ethical guidelines to prevent generating harmful content, whereas malicious LLMs are engineered to enable the generation of unethical and toxic responses. Both trusted and malicious LLMs use guardrails in differential contexts per the requirements of the developers and attackers, respectively. We explore the multifaceted world of guardrails implementation in LLMs by conducting an empirical analysis to assess the effectiveness of guardrails using prompts. Our results revealed that guardrails deployed in the trusted LLMs could be bypassed using prompt manipulation techniques such as “pretend” and “persist” to generate harmful content. In addition, we also discovered that malicious LLMs still deploy weak guardrails to evade detection by generating human-like content. This empirical analysis provides insights into the design of the malicious and trusted LLMs. We also propose recommendations to defend against prompt manipulation and guardrails bypass while designing LLMs.\",\"PeriodicalId\":55016,\"journal\":{\"name\":\"IEEE Technology and Society Magazine\",\"volume\":\"44 3\",\"pages\":\"98-108\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Technology and Society Magazine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11091436/\",\"RegionNum\":4,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Technology and Society Magazine","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11091436/","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
The Unprecedented Surge in Generative AI: Empirical Analysis of Trusted and Malicious Large Language Models (LLMs)
Trusted large language models (LLMs) inherit ethical guidelines to prevent generating harmful content, whereas malicious LLMs are engineered to enable the generation of unethical and toxic responses. Both trusted and malicious LLMs use guardrails in differential contexts per the requirements of the developers and attackers, respectively. We explore the multifaceted world of guardrails implementation in LLMs by conducting an empirical analysis to assess the effectiveness of guardrails using prompts. Our results revealed that guardrails deployed in the trusted LLMs could be bypassed using prompt manipulation techniques such as “pretend” and “persist” to generate harmful content. In addition, we also discovered that malicious LLMs still deploy weak guardrails to evade detection by generating human-like content. This empirical analysis provides insights into the design of the malicious and trusted LLMs. We also propose recommendations to defend against prompt manipulation and guardrails bypass while designing LLMs.
期刊介绍:
IEEE Technology and Society Magazine invites feature articles (refereed), special articles, and commentaries on topics within the scope of the IEEE Society on Social Implications of Technology, in the broad areas of social implications of electrotechnology, history of electrotechnology, and engineering ethics.