过于自信的人工智能？临床场景中的法律硕士自我评估基准

medRxiv Pub Date : 2024-08-11 DOI:10.1101/2024.08.11.24311810

M. Omar, Benjamin S. Glicksberg, G. Nadkarni, E. Klang

{"title":"过于自信的人工智能？临床场景中的法律硕士自我评估基准","authors":"M. Omar, Benjamin S. Glicksberg, G. Nadkarni, E. Klang","doi":"10.1101/2024.08.11.24311810","DOIUrl":null,"url":null,"abstract":"Background and Aim: Large language models (LLMs) show promise in healthcare, but their self-assessment capabilities remain unclear. This study evaluates the confidence levels and performance of 12 LLMs across five medical specialties to assess their ability to accurately judge their responses. Methods: We used 1965 multiple-choice questions from internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and confidence scores. Performance and confidence were analyzed using chi-square tests and t-tests. Consistency across question versions was also evaluated. Results: All models displayed high confidence regardless of answer correctness. Higher-tier models showed slightly better calibration, with a mean confidence of 72.5% for correct answers versus 69.4% for incorrect ones, compared to lower-tier models (79.6% vs 79.5%). The mean confidence difference between correct and incorrect responses ranged from 0.6% to 5.4% across all models. Four models showed significantly higher confidence when correct (p<0.01), but the difference remained small. Most models demonstrated consistency across question versions. Conclusion: While newer LLMs show improved performance and consistency in medical knowledge tasks, their confidence levels remain poorly calibrated. The gap between performance and self-assessment poses risks in clinical applications. Until these models can reliably gauge their certainty, their use in healthcare should be limited and supervised by experts. Further research on human-AI collaboration and ensemble methods is needed for responsible implementation. Keywords: Large Language Models (LLMs), Safe AI, AI Reliability, Clinical knowledge.","PeriodicalId":18505,"journal":{"name":"medRxiv","volume":"16 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Overconfident AI? Benchmarking LLM Self-Assessment in Clinical Scenarios\",\"authors\":\"M. Omar, Benjamin S. Glicksberg, G. Nadkarni, E. Klang\",\"doi\":\"10.1101/2024.08.11.24311810\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background and Aim: Large language models (LLMs) show promise in healthcare, but their self-assessment capabilities remain unclear. This study evaluates the confidence levels and performance of 12 LLMs across five medical specialties to assess their ability to accurately judge their responses. Methods: We used 1965 multiple-choice questions from internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and confidence scores. Performance and confidence were analyzed using chi-square tests and t-tests. Consistency across question versions was also evaluated. Results: All models displayed high confidence regardless of answer correctness. Higher-tier models showed slightly better calibration, with a mean confidence of 72.5% for correct answers versus 69.4% for incorrect ones, compared to lower-tier models (79.6% vs 79.5%). The mean confidence difference between correct and incorrect responses ranged from 0.6% to 5.4% across all models. Four models showed significantly higher confidence when correct (p<0.01), but the difference remained small. Most models demonstrated consistency across question versions. Conclusion: While newer LLMs show improved performance and consistency in medical knowledge tasks, their confidence levels remain poorly calibrated. The gap between performance and self-assessment poses risks in clinical applications. Until these models can reliably gauge their certainty, their use in healthcare should be limited and supervised by experts. Further research on human-AI collaboration and ensemble methods is needed for responsible implementation. Keywords: Large Language Models (LLMs), Safe AI, AI Reliability, Clinical knowledge.\",\"PeriodicalId\":18505,\"journal\":{\"name\":\"medRxiv\",\"volume\":\"16 4\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.11.24311810\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.11.24311810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景和目的：大语言模型（LLMs）在医疗保健领域大有可为，但其自我评估能力仍不明确。本研究评估了五个医学专业的 12 个 LLM 的信心水平和性能，以评估它们准确判断自己回答的能力。研究方法我们使用了来自内科、妇产科、精神病学、儿科和普通外科的 1965 道选择题。模型在提示下提供答案和置信度分数。采用卡方检验和 t 检验对成绩和信心进行分析。此外，还评估了不同问题版本的一致性。结果：无论答案正确与否，所有模型都显示出很高的置信度。与低层次模型（79.6% 对 79.5%）相比，高层次模型的校准效果稍好，正确答案的平均置信度为 72.5%，错误答案的平均置信度为 69.4%。在所有模型中，正确答案和错误答案之间的平均置信度差异从 0.6% 到 5.4% 不等。有四个模型显示正确答案的置信度明显更高（p<0.01），但差异仍然很小。大多数模型在不同的问题版本中表现出一致性。结论：虽然较新的 LLM 在医学知识任务中的表现和一致性有所改善，但其置信度仍未得到很好的校准。表现与自我评估之间的差距给临床应用带来了风险。在这些模型能够可靠地评估其确定性之前，它们在医疗保健领域的使用应受到限制，并应在专家的监督下进行。需要进一步研究人与人工智能的协作和组合方法，以便负责任地实施。关键词大型语言模型（LLMs）、安全人工智能、人工智能可靠性、临床知识。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Overconfident AI? Benchmarking LLM Self-Assessment in Clinical Scenarios

Background and Aim: Large language models (LLMs) show promise in healthcare, but their self-assessment capabilities remain unclear. This study evaluates the confidence levels and performance of 12 LLMs across five medical specialties to assess their ability to accurately judge their responses. Methods: We used 1965 multiple-choice questions from internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and confidence scores. Performance and confidence were analyzed using chi-square tests and t-tests. Consistency across question versions was also evaluated. Results: All models displayed high confidence regardless of answer correctness. Higher-tier models showed slightly better calibration, with a mean confidence of 72.5% for correct answers versus 69.4% for incorrect ones, compared to lower-tier models (79.6% vs 79.5%). The mean confidence difference between correct and incorrect responses ranged from 0.6% to 5.4% across all models. Four models showed significantly higher confidence when correct (p<0.01), but the difference remained small. Most models demonstrated consistency across question versions. Conclusion: While newer LLMs show improved performance and consistency in medical knowledge tasks, their confidence levels remain poorly calibrated. The gap between performance and self-assessment poses risks in clinical applications. Until these models can reliably gauge their certainty, their use in healthcare should be limited and supervised by experts. Further research on human-AI collaboration and ensemble methods is needed for responsible implementation. Keywords: Large Language Models (LLMs), Safe AI, AI Reliability, Clinical knowledge.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

medRxiv

自引率

0.00%

发文量