Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support.

IF 5.4 Q1 MEDICINE, RESEARCH & EXPERIMENTAL

Communications medicine Pub Date : 2025-08-02 DOI:10.1038/s43856-025-01021-3

Mahmud Omar, Vera Sorin, Jeremy D Collins, David Reich, Robert Freeman, Nicholas Gavin, Alexander Charney, Lisa Stump, Nicola Luigi Bragazzi, Girish N Nadkarni, Eyal Klang

{"title":"Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support.","authors":"Mahmud Omar, Vera Sorin, Jeremy D Collins, David Reich, Robert Freeman, Nicholas Gavin, Alexander Charney, Lisa Stump, Nicola Luigi Bragazzi, Girish N Nadkarni, Eyal Klang","doi":"10.1038/s43856-025-01021-3","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) show promise in clinical contexts but can generate false facts (often referred to as \"hallucinations\"). One subset of these errors arises from adversarial attacks, in which fabricated details embedded in prompts lead the model to produce or elaborate on the false information. We embedded fabricated content in clinical prompts to elicit adversarial hallucination attacks in multiple large language models. We quantified how often they elaborated on false details and tested whether a specialized mitigation prompt or altered temperature settings reduced errors.Methods: We created 300 physician-validated simulated vignettes, each containing one fabricated detail (a laboratory test, a physical or radiological sign, or a medical condition). Each vignette was presented in short and long versions-differing only in word count but identical in medical content. We tested six LLMs under three conditions: default (standard settings), mitigating prompt (designed to reduce hallucinations), and temperature 0 (deterministic output with maximum response certainty), generating 5,400 outputs. If a model elaborated on the fabricated detail, the case was classified as a \"hallucination\".Results: Hallucination rates range from 50 % to 82 % across models and prompting methods. Prompt-based mitigation lowers the overall hallucination rate (mean across all models) from 66 % to 44 % (p < 0.001). For the best-performing model, GPT-4o, rates decline from 53 % to 23 % (p < 0.001). Temperature adjustments offer no significant improvement. Short vignettes show slightly higher odds of hallucination.Conclusions: LLMs are highly susceptible to adversarial hallucination attacks, frequently generating false clinical details that pose risks when used without safeguards. While prompt engineering reduces errors, it does not eliminate them.","PeriodicalId":72646,"journal":{"name":"Communications medicine","volume":"5 1","pages":"330"},"PeriodicalIF":5.4000,"publicationDate":"2025-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12318031/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1038/s43856-025-01021-3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) show promise in clinical contexts but can generate false facts (often referred to as "hallucinations"). One subset of these errors arises from adversarial attacks, in which fabricated details embedded in prompts lead the model to produce or elaborate on the false information. We embedded fabricated content in clinical prompts to elicit adversarial hallucination attacks in multiple large language models. We quantified how often they elaborated on false details and tested whether a specialized mitigation prompt or altered temperature settings reduced errors.

Methods: We created 300 physician-validated simulated vignettes, each containing one fabricated detail (a laboratory test, a physical or radiological sign, or a medical condition). Each vignette was presented in short and long versions-differing only in word count but identical in medical content. We tested six LLMs under three conditions: default (standard settings), mitigating prompt (designed to reduce hallucinations), and temperature 0 (deterministic output with maximum response certainty), generating 5,400 outputs. If a model elaborated on the fabricated detail, the case was classified as a "hallucination".

Results: Hallucination rates range from 50 % to 82 % across models and prompting methods. Prompt-based mitigation lowers the overall hallucination rate (mean across all models) from 66 % to 44 % (p < 0.001). For the best-performing model, GPT-4o, rates decline from 53 % to 23 % (p < 0.001). Temperature adjustments offer no significant improvement. Short vignettes show slightly higher odds of hallucination.

Conclusions: LLMs are highly susceptible to adversarial hallucination attacks, frequently generating false clinical details that pose risks when used without safeguards. While prompt engineering reduces errors, it does not eliminate them.

Abstract Image

查看原文本刊更多论文

多模型保证分析表明，在临床决策支持过程中，大型语言模型极易受到对抗性幻觉攻击。

背景：大型语言模型（llm）在临床环境中表现出希望，但可能产生错误的事实（通常被称为“幻觉”）。这些错误的一个子集来自对抗性攻击，其中嵌入在提示中的虚构细节导致模型产生或详细说明虚假信息。我们在临床提示中嵌入虚构的内容，以在多个大型语言模型中引发对抗性幻觉攻击。我们量化了他们详细阐述错误细节的频率，并测试了专门的缓解提示或改变的温度设置是否减少了错误。方法：我们创建了300个经过医生验证的模拟小片段，每个小片段包含一个虚构的细节（实验室测试、物理或放射学征象或医疗状况）。每个小插图都有短版本和长版本——只有字数不同，但医学内容相同。我们在三种条件下测试了六个llm：默认（标准设置）、缓解提示（旨在减少幻觉）和温度0（具有最大响应确定性的确定性输出），生成了5,400个输出。如果一个模型详细阐述了虚构的细节，那么这个案例就被归类为“幻觉”。结果：在不同的模型和提示方法中，幻觉率从50%到82%不等。基于即时的缓解将总体幻觉率（所有模型的平均值）从66%降低到44% (p)。结论：llm极易受到对抗性幻觉攻击，在没有保护措施的情况下使用时经常产生错误的临床细节，从而构成风险。虽然快速工程减少了错误，但并不能完全消除错误。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Communications medicine

自引率

0.00%

发文量