Improving large language models accuracy for aortic stenosis treatment via Heart Team simulation: a prompt design analysis.

IF 4.4 Q1 CARDIAC & CARDIOVASCULAR SYSTEMS

European heart journal. Digital health Pub Date : 2025-06-16 eCollection Date: 2025-07-01 DOI:10.1093/ehjdh/ztaf068

Dorian Garin, Stéphane Cook, Charlie Ferry, Wesley Bennar, Mario Togni, Pascal Meier, Peter Wenaweser, Serban Puricel, Diego Arroyo

{"title":"Improving large language models accuracy for aortic stenosis treatment via Heart Team simulation: a prompt design analysis.","authors":"Dorian Garin, Stéphane Cook, Charlie Ferry, Wesley Bennar, Mario Togni, Pascal Meier, Peter Wenaweser, Serban Puricel, Diego Arroyo","doi":"10.1093/ehjdh/ztaf068","DOIUrl":null,"url":null,"abstract":"Aims: Large language models (LLMs) have shown potential in clinical decision support, but the influence of prompt design on their performance, particularly in complex cardiology decision-making, is not well understood.Methods and results: We retrospectively reviewed 231 patients evaluated by our Heart Team for severe aortic stenosis, with treatment options including surgical aortic valve replacement, transcatheter aortic valve implantation, or medical therapy. We tested multiple prompt-design strategies using zero-shot (0-shot), Chain-of-Thought (CoT), and Tree-of-Thought (ToT) prompting, combined with few-shot prompting, free/guided-thinking, and self-consistency. Patient data were condensed into standardized vignettes and queried using GPT4-o (version 2024-05-13, OpenAI) 40 times per patient under each prompt (147 840 total queries). Primary endpoint was mean accuracy; secondary endpoints included sensitivity, specificity, area under the curve (AUC), and treatment invasiveness. Guided-thinking-ToT achieved the highest accuracy (94.04%, 95% CI 90.87-97.21), significantly outperforming few-shot-ToT (87.16%, 95% CI 82.68-91.63) and few-shot-CoT (85.32%, 95% CI 80.59-90.06; P < 0.0001). Zero-shot prompting showed the lowest accuracy (73.39%, 95% CI 67.48-79.31). Guided-thinking-ToT yielded the highest AUC values (up to 0.97) and was the only prompt whose invasiveness did not differ significantly from Heart Team decisions (P = 0.078). An inverted quadratic relationship emerged between few-shot examples and accuracy, with nine examples optimal (P < 0.0001). Self-consistency improved overall accuracy, particularly for ToT-derived prompts (P < 0.001).Conclusion: Prompt design significantly impacts LLM performance in clinical decision-making for severe aortic stenosis. Tree-of-Thought prompting markedly improved accuracy and aligned recommendations with expert decisions, though LLMs tended toward conservative treatment approaches.","PeriodicalId":72965,"journal":{"name":"European heart journal. Digital health","volume":"6 4","pages":"665-674"},"PeriodicalIF":4.4000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12282391/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European heart journal. Digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/ehjdh/ztaf068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Aims: Large language models (LLMs) have shown potential in clinical decision support, but the influence of prompt design on their performance, particularly in complex cardiology decision-making, is not well understood.

Methods and results: We retrospectively reviewed 231 patients evaluated by our Heart Team for severe aortic stenosis, with treatment options including surgical aortic valve replacement, transcatheter aortic valve implantation, or medical therapy. We tested multiple prompt-design strategies using zero-shot (0-shot), Chain-of-Thought (CoT), and Tree-of-Thought (ToT) prompting, combined with few-shot prompting, free/guided-thinking, and self-consistency. Patient data were condensed into standardized vignettes and queried using GPT4-o (version 2024-05-13, OpenAI) 40 times per patient under each prompt (147 840 total queries). Primary endpoint was mean accuracy; secondary endpoints included sensitivity, specificity, area under the curve (AUC), and treatment invasiveness. Guided-thinking-ToT achieved the highest accuracy (94.04%, 95% CI 90.87-97.21), significantly outperforming few-shot-ToT (87.16%, 95% CI 82.68-91.63) and few-shot-CoT (85.32%, 95% CI 80.59-90.06; P < 0.0001). Zero-shot prompting showed the lowest accuracy (73.39%, 95% CI 67.48-79.31). Guided-thinking-ToT yielded the highest AUC values (up to 0.97) and was the only prompt whose invasiveness did not differ significantly from Heart Team decisions (P = 0.078). An inverted quadratic relationship emerged between few-shot examples and accuracy, with nine examples optimal (P < 0.0001). Self-consistency improved overall accuracy, particularly for ToT-derived prompts (P < 0.001).

Conclusion: Prompt design significantly impacts LLM performance in clinical decision-making for severe aortic stenosis. Tree-of-Thought prompting markedly improved accuracy and aligned recommendations with expert decisions, though LLMs tended toward conservative treatment approaches.

Abstract Image

查看原文本刊更多论文

通过心脏团队模拟提高主动脉瓣狭窄治疗的大型语言模型准确性：提示设计分析。

目的：大型语言模型（llm）在临床决策支持方面显示出潜力，但提示设计对其性能的影响，特别是在复杂的心脏病学决策中，尚未得到很好的理解。方法和结果：我们回顾性分析了231例经心脏小组评估的严重主动脉瓣狭窄患者，治疗方案包括手术主动脉瓣置换术、经导管主动脉瓣植入术或药物治疗。我们测试了多种提示设计策略，包括零提示（0-shot）、思维链（CoT）和思维树（ToT）提示，结合少提示、自由/引导思维和自我一致性。将患者数据浓缩为标准化的小片段，并在每个提示下使用GPT4-o（版本2024-05-13，OpenAI）对每位患者进行40次查询（总查询次数为147 840次）。主要终点为平均准确度；次要终点包括敏感性、特异性、曲线下面积（AUC）和治疗侵袭性。guided thinking- tot准确率最高（94.04%,95% CI 90.87 ~ 97.21），显著优于few-shot-ToT （87.16%, 95% CI 82.68 ~ 91.63）和few-shot-CoT (85.32%, 95% CI 80.59 ~ 90.06)；P < 0.0001)。零针提示准确率最低（73.39%,95% CI 67.48 ~ 79.31）。引导思维- tot产生最高的AUC值（高达0.97），并且是唯一的提示，其侵入性与心脏团队决策没有显著差异（P = 0.078）。少量样本与准确率呈倒二次关系，其中9个样本最优（P < 0.0001）。自我一致性提高了整体准确性，特别是对于来自tot的提示（P < 0.001）。结论：提示设计显著影响LLM在重度主动脉瓣狭窄临床决策中的表现。尽管法学硕士倾向于保守治疗方法，但思想树法显著提高了准确性，并使建议与专家决策保持一致。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European heart journal. Digital health

CiteScore

5.00

自引率

0.00%

发文量