Diagnostic accuracy of GPT-4 on common clinical scenarios and challenging cases

IF 2.6 Q2 HEALTH POLICY & SERVICES

Learning Health Systems Pub Date : 2024-06-25 DOI:10.1002/lrh2.10438

Geoffrey W. Rutledge

{"title":"Diagnostic accuracy of GPT-4 on common clinical scenarios and challenging cases","authors":"Geoffrey W. Rutledge","doi":"10.1002/lrh2.10438","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>Large language models (LLMs) have a high diagnostic accuracy when they evaluate previously published clinical cases.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We compared the accuracy of GPT-4's differential diagnoses for previously unpublished challenging case scenarios with the diagnostic accuracy for previously published cases.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>For a set of previously unpublished challenging clinical cases, GPT-4 achieved 61.1% correct in its top 6 diagnoses versus the previously reported 49.1% for physicians. For a set of 45 clinical vignettes of more common clinical scenarios, GPT-4 included the correct diagnosis in its top 3 diagnoses 100% of the time versus the previously reported 84.3% for physicians.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>GPT-4 performs at a level at least as good as, if not better than, that of experienced physicians on highly challenging cases in internal medicine. The extraordinary performance of GPT-4 on diagnosing common clinical scenarios could be explained in part by the fact that these cases were previously published and may have been included in the training dataset for this LLM.</p>\n </section>\n </div>","PeriodicalId":43916,"journal":{"name":"Learning Health Systems","volume":"8 3","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/lrh2.10438","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Learning Health Systems","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/lrh2.10438","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH POLICY & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

Large language models (LLMs) have a high diagnostic accuracy when they evaluate previously published clinical cases.

Methods

We compared the accuracy of GPT-4's differential diagnoses for previously unpublished challenging case scenarios with the diagnostic accuracy for previously published cases.

Results

For a set of previously unpublished challenging clinical cases, GPT-4 achieved 61.1% correct in its top 6 diagnoses versus the previously reported 49.1% for physicians. For a set of 45 clinical vignettes of more common clinical scenarios, GPT-4 included the correct diagnosis in its top 3 diagnoses 100% of the time versus the previously reported 84.3% for physicians.

Conclusions

GPT-4 performs at a level at least as good as, if not better than, that of experienced physicians on highly challenging cases in internal medicine. The extraordinary performance of GPT-4 on diagnosing common clinical scenarios could be explained in part by the fact that these cases were previously published and may have been included in the training dataset for this LLM.

查看原文本刊更多论文

GPT-4 对常见临床情况和疑难病例的诊断准确性

引言大型语言模型（LLM）在评估以前发表的临床病例时具有很高的诊断准确性。方法我们比较了 GPT-4 对以前未发表的高难度病例的鉴别诊断准确率和对以前已发表病例的诊断准确率。结果对于一组之前未发表的具有挑战性的临床病例，GPT-4 的前 6 项诊断正确率为 61.1%，而之前报道的医生的正确率为 49.1%。在一组 45 个更常见的临床案例中，GPT-4 在前 3 个诊断中的正确率为 100%，而之前报道的医生的正确率为 84.3%。结论 GPT-4 在内科高难度病例上的表现至少与经验丰富的医生相当，甚至更好。GPT-4 在诊断常见临床病例时表现非凡，部分原因可能是这些病例之前已经发表过，并且可能已经包含在本 LLM 的训练数据集中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊