Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators.

ArXiv Pub Date : 2025-03-21

Nicholas Wan, Qiao Jin, Joey Chan, Guangzhi Xiong, Serina Applebaum, Aidan Gilson, Reid McMurry, R Andrew Taylor, Aidong Zhang, Qingyu Chen, Zhiyong Lu

{"title":"Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators.","authors":"Nicholas Wan, Qiao Jin, Joey Chan, Guangzhi Xiong, Serina Applebaum, Aidan Gilson, Reid McMurry, R Andrew Taylor, Aidong Zhang, Qingyu Chen, Zhiyong Lu","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine LLMs, including open-source, proprietary, and domain-specific models, with 1,009 multiple-choice question-answer pairs across 35 clinical calculators and compared LLMs to humans on a subset of questions. While the highest-performing LLM, OpenAI o1, provided an answer accuracy of 66.0% (CI: 56.7-75.3%) on the subset of 100 questions, two human annotators nominally outperformed LLMs with an average answer accuracy of 79.5% (CI: 73.5-85.0%). Ultimately, we evaluated medical trainees and LLMs in recommending medical calculators across clinical scenarios like risk stratification and diagnosis. With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (49.3% of errors) and calculator knowledge (7.1% of errors), our findings highlight that LLMs are not superior to humans in calculator recommendation.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11722524/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine LLMs, including open-source, proprietary, and domain-specific models, with 1,009 multiple-choice question-answer pairs across 35 clinical calculators and compared LLMs to humans on a subset of questions. While the highest-performing LLM, OpenAI o1, provided an answer accuracy of 66.0% (CI: 56.7-75.3%) on the subset of 100 questions, two human annotators nominally outperformed LLMs with an average answer accuracy of 79.5% (CI: 73.5-85.0%). Ultimately, we evaluated medical trainees and LLMs in recommending medical calculators across clinical scenarios like risk stratification and diagnosis. With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (49.3% of errors) and calculator knowledge (7.1% of errors), our findings highlight that LLMs are not superior to humans in calculator recommendation.

本刊更多论文

在复杂的临床决策中，人类仍然优于大型语言模型：医疗计算器研究

虽然大语言模型（LLMs）已通过医学执业资格考试对医学常识进行了评估，但其有效支持临床决策任务（如选择和使用医学计算器）的能力仍不确定。在此，我们评估了医学受训者和 LLMs 针对各种多选临床情景（如风险分层、预后和疾病诊断）推荐医学计算器的能力。我们评估了八种 LLM，包括开源模型、专有模型和特定领域模型，使用了 35 种临床计算器中的 1009 个问题-答案对，并测量了人类在 100 个问题子集上的表现。虽然性能最高的 LLM GPT-4o 的答案准确率为 74.3%（CI：71.5-76.9%），但人类注释者的平均准确率为 79.5%（CI：73.5-85.0%），超过了 LLM。错误分析表明，表现最好的 LLM 在理解能力（56.6%）和计算器知识（8.1%）方面仍然会犯错误，我们的研究结果强调，在计算器推荐等复杂的临床任务上，人类仍然能够超越 LLM。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量