Ethical implications of using general-purpose LLMs in clinical settings: a comparative analysis of prompt engineering strategies and their impact on patient safety.
{"title":"Ethical implications of using general-purpose LLMs in clinical settings: a comparative analysis of prompt engineering strategies and their impact on patient safety.","authors":"Pouyan Esmaeilzadeh","doi":"10.1186/s12911-025-03182-6","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The rapid integration of large language models (LLMs) into healthcare raises critical ethical concerns regarding patient safety, reliability, transparency, and equitable care delivery. Despite not being trained explicitly on medical data, individuals increasingly use general-purpose LLMs to address medical questions and clinical scenarios. While prompt engineering can optimize LLM performance, its ethical implications for clinical decision-making remain underexplored. This study aimed to evaluate the ethical dimensions of prompt engineering strategies in the clinical applications of LLMs, focusing on safety, bias, transparency, and their implications for the responsible implementation of AI in healthcare.</p><p><strong>Methods: </strong>We conducted an ethics-focused analysis of three advanced and reasoning-capable LLMs (OpenAI O3, Claude Sonnet 4, Google Gemini 2.5 Pro) across six prompt engineering strategies and five clinical scenarios of varying ethical complexity. Six expert clinicians evaluated 90 responses using domains that included diagnostic accuracy, safety assessment, communication, empathy, and ethical reasoning. We specifically analyzed safety incidents, bias patterns, and transparency of reasoning processes.</p><p><strong>Results: </strong>Significant ethical concerns emerged across all models and scenarios. Critical safety issues occurred in 12.2% of responses, with concentration in complex ethical scenarios (Level 5: 23.1% vs. Level 1: 2.3%, p < 0.001). Meta-cognitive prompting demonstrated superior ethical reasoning (mean ethics score: 78.3 ± 9.1), while safety-first prompting reduced safety incidents by 45% compared to zero-shot approaches (8.9% vs. 16.2%). However, all models showed concerning deficits in communication empathy (mean 54% of maximum) and exhibited potential bias in complex multi-cultural scenarios. Transparency varied significantly by prompt strategy, with meta-cognitive approaches providing the clearest reasoning pathways (4.2 vs. 1.8 explicit reasoning steps), which are essential for clinical accountability. The study highlighted critical gaps in ethical decision-making transparency, with meta-cognitive approaches providing 4.2 explicit reasoning steps compared to 1.8 in zero-shot methods (p < 0.001). Bias patterns disproportionately affected vulnerable populations, with systematic underestimation of treatment appropriateness in elderly patients and inadequate cultural considerations in end-of-life scenarios.</p><p><strong>Conclusions: </strong>Current clinical applications of general-purpose LLMs present substantial ethical challenges requiring urgent attention. While structured prompt engineering demonstrated measurable improvements in some domains, with meta-cognitive approaches showing 13.0% performance gains and safety-first prompting reducing critical incidents by 45%, substantial limitations persist across all strategies. Even optimized approaches achieved inadequate performance in communication and empathy (≤ 54% of maximum), retained residual bias patterns (11.7% in safety-first conditions), and exhibited concerning safety deficits, indicating that current prompt engineering methods provide only marginal improvements, which are insufficient for reliable clinical deployment. These findings highlight significant ethical challenges that necessitate further investigation into the development of appropriate guidelines and regulatory frameworks for the clinical use of general-purpose AI models.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"342"},"PeriodicalIF":3.8000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12481957/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03182-6","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The rapid integration of large language models (LLMs) into healthcare raises critical ethical concerns regarding patient safety, reliability, transparency, and equitable care delivery. Despite not being trained explicitly on medical data, individuals increasingly use general-purpose LLMs to address medical questions and clinical scenarios. While prompt engineering can optimize LLM performance, its ethical implications for clinical decision-making remain underexplored. This study aimed to evaluate the ethical dimensions of prompt engineering strategies in the clinical applications of LLMs, focusing on safety, bias, transparency, and their implications for the responsible implementation of AI in healthcare.
Methods: We conducted an ethics-focused analysis of three advanced and reasoning-capable LLMs (OpenAI O3, Claude Sonnet 4, Google Gemini 2.5 Pro) across six prompt engineering strategies and five clinical scenarios of varying ethical complexity. Six expert clinicians evaluated 90 responses using domains that included diagnostic accuracy, safety assessment, communication, empathy, and ethical reasoning. We specifically analyzed safety incidents, bias patterns, and transparency of reasoning processes.
Results: Significant ethical concerns emerged across all models and scenarios. Critical safety issues occurred in 12.2% of responses, with concentration in complex ethical scenarios (Level 5: 23.1% vs. Level 1: 2.3%, p < 0.001). Meta-cognitive prompting demonstrated superior ethical reasoning (mean ethics score: 78.3 ± 9.1), while safety-first prompting reduced safety incidents by 45% compared to zero-shot approaches (8.9% vs. 16.2%). However, all models showed concerning deficits in communication empathy (mean 54% of maximum) and exhibited potential bias in complex multi-cultural scenarios. Transparency varied significantly by prompt strategy, with meta-cognitive approaches providing the clearest reasoning pathways (4.2 vs. 1.8 explicit reasoning steps), which are essential for clinical accountability. The study highlighted critical gaps in ethical decision-making transparency, with meta-cognitive approaches providing 4.2 explicit reasoning steps compared to 1.8 in zero-shot methods (p < 0.001). Bias patterns disproportionately affected vulnerable populations, with systematic underestimation of treatment appropriateness in elderly patients and inadequate cultural considerations in end-of-life scenarios.
Conclusions: Current clinical applications of general-purpose LLMs present substantial ethical challenges requiring urgent attention. While structured prompt engineering demonstrated measurable improvements in some domains, with meta-cognitive approaches showing 13.0% performance gains and safety-first prompting reducing critical incidents by 45%, substantial limitations persist across all strategies. Even optimized approaches achieved inadequate performance in communication and empathy (≤ 54% of maximum), retained residual bias patterns (11.7% in safety-first conditions), and exhibited concerning safety deficits, indicating that current prompt engineering methods provide only marginal improvements, which are insufficient for reliable clinical deployment. These findings highlight significant ethical challenges that necessitate further investigation into the development of appropriate guidelines and regulatory frameworks for the clinical use of general-purpose AI models.
期刊介绍:
BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.