在回答临床问题时对大型语言模型的信心进行基准测试：横断面评估研究。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-05-16 DOI:10.2196/66917

Mahmud Omar, Reem Agbareia, Benjamin S Glicksberg, Girish N Nadkarni, Eyal Klang

{"title":"在回答临床问题时对大型语言模型的信心进行基准测试：横断面评估研究。","authors":"Mahmud Omar, Reem Agbareia, Benjamin S Glicksberg, Girish N Nadkarni, Eyal Klang","doi":"10.2196/66917","DOIUrl":null,"url":null,"abstract":"Background: The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored.Objective: This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs' ability to accurately judge their own responses.Methods: We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model's mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests.Results: The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003).Conclusions: Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e66917"},"PeriodicalIF":3.1000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12101789/pdf/","citationCount":"0","resultStr":"{\"title\":\"Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study.\",\"authors\":\"Mahmud Omar, Reem Agbareia, Benjamin S Glicksberg, Girish N Nadkarni, Eyal Klang\",\"doi\":\"10.2196/66917\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored.Objective: This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs' ability to accurately judge their own responses.Methods: We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model's mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests.Results: The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003).Conclusions: Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.\",\"PeriodicalId\":56334,\"journal\":{\"name\":\"JMIR Medical Informatics\",\"volume\":\"13 \",\"pages\":\"e66917\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12101789/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/66917\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/66917","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：大型语言模型（llm）在回答生物医学领域问题时自我评估自己信心的能力仍未得到充分探索。目的：本研究评估5个医学专业的12名法学硕士的置信水平，以评估法学硕士准确判断自己反应的能力。方法：我们使用1965个多项选择题来评估以下领域的临床知识：内科、妇产科、精神病学、儿科和普外科。模型被提示提供答案，并提供他们对正确答案的信心（得分范围：0 -100%）。我们计算了每个模型正确答案的平均置信度得分与每个模型在所有问题上的总体准确性之间的相关性。还分析了正确和错误答案的置信度得分，以确定置信度的平均差异，使用2样本，2尾t检验。结果：正确答案的平均置信度得分与模型准确性呈负相关且有统计学意义(r=-0.40；P=.001)，表明表现较差的模型自相矛盾地表现出较高的置信度。例如，表现最好的模型gpt - 40的平均准确率为74% (SD 9.4%)，平均置信度为63% (SD 8.3%)，而表现较差的模型qwen2 - 7b的平均准确率为46% (SD 10.5%)，但平均置信度为76% （SD 11.7%）。所有模型正确和错误回答之间的平均置信度差都很低，范围在0.6%到5.4%之间，gpt - 40的平均差值最高(5.4%,SD 2.3%；P = .003)。结论：表现较好的法学硕士表现出更一致的整体信心水平。然而，即使是最精确的模型，在正确答案和错误答案之间的信心变化仍然很小。这可能会限制它们在临床环境中的安全使用。解决过度自信可能涉及改进校准方法，执行特定领域的微调，以及在决策带有高风险时涉及人类监督。在更广泛的临床应用LLMs之前，需要进一步的研究来改进这些策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study.

Background: The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored.

Objective: This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs' ability to accurately judge their own responses.

Methods: We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model's mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests.

Results: The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003).

Conclusions: Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.