The PIEE Cycle: A Structured Framework for Red Teaming Large Language Models in Clinical Decision-Making.

IF 3.7 3区医学 Q2 ENGINEERING, BIOMEDICAL

Bioengineering Pub Date : 2025-06-27 DOI:10.3390/bioengineering12070706

Maissa Trabilsy, Srinivasagam Prabha, Cesar A Gomez-Cabello, Syed Ali Haider, Ariana Genovese, Sahar Borna, Nadia Wood, Narayanan Gopala, Cui Tao, Antonio J Forte

{"title":"The PIEE Cycle: A Structured Framework for Red Teaming Large Language Models in Clinical Decision-Making.","authors":"Maissa Trabilsy, Srinivasagam Prabha, Cesar A Gomez-Cabello, Syed Ali Haider, Ariana Genovese, Sahar Borna, Nadia Wood, Narayanan Gopala, Cui Tao, Antonio J Forte","doi":"10.3390/bioengineering12070706","DOIUrl":null,"url":null,"abstract":"<p><p>The increasing integration of large language models (LLMs) into healthcare presents significant opportunities, but also critical risks related to patient safety, accuracy, and ethical alignment. Despite these concerns, no standardized framework exists for systematically evaluating and stress testing LLM behavior in clinical decision-making. The PIEE cycle-Planning and Preparation, Information Gathering and Prompt Generation, Execution, and Evaluation-is a structured red-teaming framework developed specifically to address artificial intelligence (AI) safety risks in healthcare decision-making. PIEE enables clinicians and informatics teams to simulate adversarial prompts, including jailbreaking, social engineering, and distractor attacks, to stress-test language models in real-world clinical scenarios. Model performance is evaluated using specific metrics such as true positive and false positive rates for detecting harmful content, hallucination rates measured through adapted TruthfulQA scoring, safety and reliability assessments, bias detection via adapted BBQ benchmarks, and ethical evaluation using structured Likert-based scoring rubrics. The framework is illustrated using examples from plastic surgery, but is adaptable across specialties, and is intended for use by all medical providers, regardless of their backgrounds or familiarity with artificial intelligence. While the framework is currently conceptual and validation is ongoing, PIEE provides a practical foundation for assessing the clinical reliability and ethical robustness of LLMs in medicine.</p>","PeriodicalId":8874,"journal":{"name":"Bioengineering","volume":"12 7","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12292938/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioengineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.3390/bioengineering12070706","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing integration of large language models (LLMs) into healthcare presents significant opportunities, but also critical risks related to patient safety, accuracy, and ethical alignment. Despite these concerns, no standardized framework exists for systematically evaluating and stress testing LLM behavior in clinical decision-making. The PIEE cycle-Planning and Preparation, Information Gathering and Prompt Generation, Execution, and Evaluation-is a structured red-teaming framework developed specifically to address artificial intelligence (AI) safety risks in healthcare decision-making. PIEE enables clinicians and informatics teams to simulate adversarial prompts, including jailbreaking, social engineering, and distractor attacks, to stress-test language models in real-world clinical scenarios. Model performance is evaluated using specific metrics such as true positive and false positive rates for detecting harmful content, hallucination rates measured through adapted TruthfulQA scoring, safety and reliability assessments, bias detection via adapted BBQ benchmarks, and ethical evaluation using structured Likert-based scoring rubrics. The framework is illustrated using examples from plastic surgery, but is adaptable across specialties, and is intended for use by all medical providers, regardless of their backgrounds or familiarity with artificial intelligence. While the framework is currently conceptual and validation is ongoing, PIEE provides a practical foundation for assessing the clinical reliability and ethical robustness of LLMs in medicine.

查看原文本刊更多论文

PIEE循环：临床决策中大型语言模型的结构化框架。

大型语言模型（llm）越来越多地集成到医疗保健中，这带来了重要的机会，但也带来了与患者安全、准确性和道德一致性相关的重大风险。尽管存在这些担忧，但目前还没有标准化的框架来系统地评估和压力测试LLM在临床决策中的行为。PIEE周期——计划和准备、信息收集和提示生成、执行和评估——是一个结构化的红队框架，专门用于解决医疗保健决策中的人工智能（AI）安全风险。PIEE使临床医生和信息学团队能够模拟对抗性提示，包括越狱、社会工程和干扰攻击，从而在真实的临床场景中对语言模型进行压力测试。模型性能使用特定指标进行评估，例如检测有害内容的真阳性和假阳性率，通过调整TruthfulQA评分测量的幻觉率，安全性和可靠性评估，通过调整BBQ基准进行偏差检测，以及使用结构化李克特评分标准进行道德评估。该框架使用整形外科的例子来说明，但可适用于各个专业，并且旨在供所有医疗提供者使用，无论其背景或对人工智能的熟悉程度如何。虽然该框架目前尚处于概念阶段，验证仍在进行中，但PIEE为评估医学法学硕士的临床可靠性和伦理稳健性提供了实践基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioengineering Chemical Engineering-Bioengineering

CiteScore

4.00

自引率

8.70%

发文量

661

期刊介绍： Aims Bioengineering (ISSN 2306-5354) provides an advanced forum for the science and technology of bioengineering. It publishes original research papers, comprehensive reviews, communications and case reports. Our aim is to encourage scientists to publish their experimental and theoretical results in as much detail as possible. All aspects of bioengineering are welcomed from theoretical concepts to education and applications. There is no restriction on the length of the papers. The full experimental details must be provided so that the results can be reproduced. There are, in addition, four key features of this Journal: ● We are introducing a new concept in scientific and technical publications “The Translational Case Report in Bioengineering”. It is a descriptive explanatory analysis of a transformative or translational event. Understanding that the goal of bioengineering scholarship is to advance towards a transformative or clinical solution to an identified transformative/clinical need, the translational case report is used to explore causation in order to find underlying principles that may guide other similar transformative/translational undertakings. ● Manuscripts regarding research proposals and research ideas will be particularly welcomed. ● Electronic files and software regarding the full details of the calculation and experimental procedure, if unable to be published in a normal way, can be deposited as supplementary material. ● We also accept manuscripts communicating to a broader audience with regard to research projects financed with public funds. Scope ● Bionics and biological cybernetics: implantology; bio–abio interfaces ● Bioelectronics: wearable electronics; implantable electronics; “more than Moore” electronics; bioelectronics devices ● Bioprocess and biosystems engineering and applications: bioprocess design; biocatalysis; bioseparation and bioreactors; bioinformatics; bioenergy; etc. ● Biomolecular, cellular and tissue engineering and applications: tissue engineering; chromosome engineering; embryo engineering; cellular, molecular and synthetic biology; metabolic engineering; bio-nanotechnology; micro/nano technologies; genetic engineering; transgenic technology ● Biomedical engineering and applications: biomechatronics; biomedical electronics; biomechanics; biomaterials; biomimetics; biomedical diagnostics; biomedical therapy; biomedical devices; sensors and circuits; biomedical imaging and medical information systems; implants and regenerative medicine; neurotechnology; clinical engineering; rehabilitation engineering ● Biochemical engineering and applications: metabolic pathway engineering; modeling and simulation ● Translational bioengineering