Susan Landon, Thomas Savage, S Ryan Greysen, Eric Bressman
{"title":"Variation in Large Language Model Recommendations in Challenging Inpatient Management Scenarios.","authors":"Susan Landon, Thomas Savage, S Ryan Greysen, Eric Bressman","doi":"10.1007/s11606-025-09888-7","DOIUrl":null,"url":null,"abstract":"<p><strong>Importance: </strong> Large language models (LLMs) are entering clinical workflows, yet their behavior in routine bedside decisions that lack a single \"correct\" recommendation remains unclear.</p><p><strong>Objective: </strong>To describe variation within and across commercially available LLMs when confronted with common, judgment-dependent inpatient medicine management scenarios.</p><p><strong>Design: </strong>Cross-sectional simulation study. Four brief vignettes requiring a binary management decision were posed to each model in five independent sessions. Six LLMs were queried: five general-purpose (GPT-4o, GPT-o1, Claude 3.7 Sonnet, Grok 3, and Gemini 2.0 Flash) and one domain-specific (OpenEvidence).</p><p><strong>Exposures: </strong>Standardized prompts describing (1) transfusion at borderline hemoglobin, (2) resumption of anticoagulation after gastrointestinal bleed, (3) discharge readiness despite a modest creatinine rise, and (4) peri-procedural bridging in a high-risk patient on apixaban.</p><p><strong>Main measures: </strong>Primary outcomes were each model's overall recommendation (majority across five runs) and its internal consistency (proportion of identical recommendations across runs; range 0-1). Inter-model agreement was the proportion of models giving the same recommendation.</p><p><strong>Results: </strong>A total of 120 model-vignette interactions were analyzed. Inter-model recommendations diverged in every scenario: transfuse vs observe (67% vs 33% of models), restart vs hold anticoagulation (50% vs 50%), discharge vs delay (50% vs 50%), and bridge vs no-bridge (17% vs 83%). Across five repeated queries of the same vignette, some models changed recommendations in two of five runs (internal consistency as low as 0.60). OpenEvidence was the most internally consistent and concrete in its recommendations; every other model displayed internal variability in one or more vignettes.</p><p><strong>Conclusions: </strong>For nuanced inpatient management questions, widely used LLMs produced inter- and intra-model variation in their recommendations. Clinicians should view LLM output as one perspective among many, consider sampling multiple models or re-prompting, and retain final responsibility for bedside decisions. Prospective studies are needed to test designs that surface model uncertainty and support safe integration of generative AI into complex decision-making.</p>","PeriodicalId":15860,"journal":{"name":"Journal of General Internal Medicine","volume":" ","pages":""},"PeriodicalIF":4.2000,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of General Internal Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s11606-025-09888-7","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Importance: Large language models (LLMs) are entering clinical workflows, yet their behavior in routine bedside decisions that lack a single "correct" recommendation remains unclear.
Objective: To describe variation within and across commercially available LLMs when confronted with common, judgment-dependent inpatient medicine management scenarios.
Design: Cross-sectional simulation study. Four brief vignettes requiring a binary management decision were posed to each model in five independent sessions. Six LLMs were queried: five general-purpose (GPT-4o, GPT-o1, Claude 3.7 Sonnet, Grok 3, and Gemini 2.0 Flash) and one domain-specific (OpenEvidence).
Exposures: Standardized prompts describing (1) transfusion at borderline hemoglobin, (2) resumption of anticoagulation after gastrointestinal bleed, (3) discharge readiness despite a modest creatinine rise, and (4) peri-procedural bridging in a high-risk patient on apixaban.
Main measures: Primary outcomes were each model's overall recommendation (majority across five runs) and its internal consistency (proportion of identical recommendations across runs; range 0-1). Inter-model agreement was the proportion of models giving the same recommendation.
Results: A total of 120 model-vignette interactions were analyzed. Inter-model recommendations diverged in every scenario: transfuse vs observe (67% vs 33% of models), restart vs hold anticoagulation (50% vs 50%), discharge vs delay (50% vs 50%), and bridge vs no-bridge (17% vs 83%). Across five repeated queries of the same vignette, some models changed recommendations in two of five runs (internal consistency as low as 0.60). OpenEvidence was the most internally consistent and concrete in its recommendations; every other model displayed internal variability in one or more vignettes.
Conclusions: For nuanced inpatient management questions, widely used LLMs produced inter- and intra-model variation in their recommendations. Clinicians should view LLM output as one perspective among many, consider sampling multiple models or re-prompting, and retain final responsibility for bedside decisions. Prospective studies are needed to test designs that surface model uncertainty and support safe integration of generative AI into complex decision-making.
重要性:大型语言模型(llm)正在进入临床工作流程,但它们在缺乏单一“正确”建议的常规床边决策中的行为仍不清楚。目的:描述在面对常见的、依赖判断的住院患者药物管理场景时,商业llm内部和之间的差异。设计:横断面模拟研究。在五个独立的会议中,向每个模型提出了四个简短的要求二元管理决策的小插曲。六个法学硕士被询问:五个通用(gpt - 40, gpt - 01, Claude 3.7 Sonnet, Grok 3和Gemini 2.0 Flash)和一个特定领域(OpenEvidence)。暴露:标准化提示描述(1)临界血红蛋白输血,(2)胃肠道出血后抗凝恢复,(3)肌酸酐适度升高出院准备,(4)高危患者阿哌沙班围手术期桥接。主要测量:主要结果是每个模型的总体建议(五次运行中的大多数)及其内部一致性(不同运行中相同建议的比例;范围0-1)。模型间的一致是给出相同建议的模型的比例。结果:共分析了120个模型-小片段相互作用。各模型间的建议在每种情况下都存在分歧:输血vs观察(67% vs 33%的模型),重新启动vs保持抗凝(50% vs 50%),出院vs延迟(50% vs 50%),过桥vs无过桥(17% vs 83%)。在对同一小插曲的五次重复查询中,一些模型在五次运行中的两次更改了建议(内部一致性低至0.60)。OpenEvidence的建议在内部是最一致和最具体的;每一个其他模型都在一个或多个小图中显示出内部变异性。结论:对于细微的住院病人管理问题,广泛使用的法学硕士在他们的建议中产生了模型间和模型内的差异。临床医生应该将LLM输出视为众多观点中的一个,考虑采样多个模型或重新提示,并保留临床决策的最终责任。需要前瞻性研究来测试设计,以显示模型的不确定性,并支持生成式人工智能与复杂决策的安全集成。
期刊介绍:
The Journal of General Internal Medicine is the official journal of the Society of General Internal Medicine. It promotes improved patient care, research, and education in primary care, general internal medicine, and hospital medicine. Its articles focus on topics such as clinical medicine, epidemiology, prevention, health care delivery, curriculum development, and numerous other non-traditional themes, in addition to classic clinical research on problems in internal medicine.