The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study

IF 23.8 1区医学 Q1 MEDICAL INFORMATICS

Lancet Digital Health Pub Date : 2024-08-01 DOI:10.1016/S2589-7500(24)00097-9

David M Levine MD , Rudraksh Tuwani BS , Benjamin Kompa MPhil , Amita Varma BS , Samuel G Finlayson MD PhD , Prof Ateev Mehrotra MD , Andrew Beam PhD

{"title":"The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study","authors":"David M Levine MD , Rudraksh Tuwani BS , Benjamin Kompa MPhil , Amita Varma BS , Samuel G Finlayson MD PhD , Prof Ateev Mehrotra MD , Andrew Beam PhD","doi":"10.1016/S2589-7500(24)00097-9","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labelled data, making deployment and generalisability challenging. How well a general-purpose AI language model performs diagnosis and triage relative to physicians and laypeople is not well understood.</p></div><div><h3>Methods</h3><p>We compared the predictive accuracy of Generative Pre-trained Transformer 3 (GPT-3)'s diagnostic and triage ability for 48 validated synthetic case vignettes (<50 words; sixth-grade reading level or below) of both common (eg, viral illness) and severe (eg, heart attack) conditions to a nationally representative sample of 5000 lay people from the USA who could use the internet to find the correct options and 21 practising physicians at Harvard Medical School. There were 12 vignettes for each of four triage categories: emergent, within one day, within 1 week, and self-care. The correct diagnosis and triage category (ie, ground truth) for each vignette was determined by two general internists at Harvard Medical School. For each vignette, human respondents and GPT-3 were prompted to list diagnoses in order of likelihood, and the vignette was marked as correct if the ground-truth diagnosis was in the top three of the listed diagnoses. For triage accuracy, we examined whether the human respondents’ and GPT-3's selected triage was exactly correct according to the four triage categories, or matched a dichotomised triage variable (emergent or within 1 day <em>vs</em> within 1 week or self-care). We estimated GPT-3's diagnostic and triage confidence on a given vignette using a modified bootstrap resampling procedure, and examined how well calibrated GPT-3's confidence was by computing calibration curves and Brier scores. We also performed subgroup analysis by case acuity, and an error analysis for triage advice to characterise how its advice might affect patients using this tool to decide if they should seek medical care immediately.</p></div><div><h3>Findings</h3><p>Among all cases, GPT-3 replied with the correct diagnosis in its top three for 88% (42/48, 95% CI 75–94) of cases, compared with 54% (2700/5000, 53–55) for lay individuals (p<0.0001) and 96% (637/666, 94–97) for physicians (p=0·012). GPT-3 triaged 70% correct (34/48, 57–82) versus 74% (3706/5000, 73–75; p=0.60) for lay individuals and 91% (608/666, 89–93%; p<0.0001) for physicians. As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well calibrated for diagnosis (Brier score=0·18) and triage (Brier score=0·22). We observed an inverse relationship between case acuity and GPT-3 accuracy (p<0·0001) with a fitted trend line of –8·33% decrease in accuracy for every level of increase in case acuity. For triage error analysis, GPT-3 deprioritised truly emergent cases in seven instances.</p></div><div><h3>Interpretation</h3><p>A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below, physicians and better than lay individuals. We found that GPT-3's performance was inferior to physicians for triage, sometimes by a large margin, and its performance was closer to that of lay individuals. Although the diagnostic performance of GPT-3 was comparable to physicians, it was significantly better than a typical person using a search engine.</p></div><div><h3>Funding</h3><p>The National Heart, Lung, and Blood Institute.</p></div>","PeriodicalId":48534,"journal":{"name":"Lancet Digital Health","volume":"6 8","pages":"Pages e555-e561"},"PeriodicalIF":23.8000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2589750024000979/pdfft?md5=ea4e50c92b21c03fc0e3ebee146bfe6e&pid=1-s2.0-S2589750024000979-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lancet Digital Health","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589750024000979","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labelled data, making deployment and generalisability challenging. How well a general-purpose AI language model performs diagnosis and triage relative to physicians and laypeople is not well understood.

Methods

We compared the predictive accuracy of Generative Pre-trained Transformer 3 (GPT-3)'s diagnostic and triage ability for 48 validated synthetic case vignettes (<50 words; sixth-grade reading level or below) of both common (eg, viral illness) and severe (eg, heart attack) conditions to a nationally representative sample of 5000 lay people from the USA who could use the internet to find the correct options and 21 practising physicians at Harvard Medical School. There were 12 vignettes for each of four triage categories: emergent, within one day, within 1 week, and self-care. The correct diagnosis and triage category (ie, ground truth) for each vignette was determined by two general internists at Harvard Medical School. For each vignette, human respondents and GPT-3 were prompted to list diagnoses in order of likelihood, and the vignette was marked as correct if the ground-truth diagnosis was in the top three of the listed diagnoses. For triage accuracy, we examined whether the human respondents’ and GPT-3's selected triage was exactly correct according to the four triage categories, or matched a dichotomised triage variable (emergent or within 1 day vs within 1 week or self-care). We estimated GPT-3's diagnostic and triage confidence on a given vignette using a modified bootstrap resampling procedure, and examined how well calibrated GPT-3's confidence was by computing calibration curves and Brier scores. We also performed subgroup analysis by case acuity, and an error analysis for triage advice to characterise how its advice might affect patients using this tool to decide if they should seek medical care immediately.

Findings

Among all cases, GPT-3 replied with the correct diagnosis in its top three for 88% (42/48, 95% CI 75–94) of cases, compared with 54% (2700/5000, 53–55) for lay individuals (p<0.0001) and 96% (637/666, 94–97) for physicians (p=0·012). GPT-3 triaged 70% correct (34/48, 57–82) versus 74% (3706/5000, 73–75; p=0.60) for lay individuals and 91% (608/666, 89–93%; p<0.0001) for physicians. As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well calibrated for diagnosis (Brier score=0·18) and triage (Brier score=0·22). We observed an inverse relationship between case acuity and GPT-3 accuracy (p<0·0001) with a fitted trend line of –8·33% decrease in accuracy for every level of increase in case acuity. For triage error analysis, GPT-3 deprioritised truly emergent cases in seven instances.

Interpretation

A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below, physicians and better than lay individuals. We found that GPT-3's performance was inferior to physicians for triage, sometimes by a large margin, and its performance was closer to that of lay individuals. Although the diagnostic performance of GPT-3 was comparable to physicians, it was significantly better than a typical person using a search engine.

Funding

The National Heart, Lung, and Blood Institute.

查看原文本刊更多论文

GPT-3 人工智能模型的诊断和分诊准确性：一项观察研究。

背景：人工智能（AI）在医疗保健领域的应用在许多医学领域都很有效，但它们通常是使用标记数据针对单一任务进行训练的，这使得部署和通用性具有挑战性。相对于医生和非专业人士而言，通用人工智能语言模型在诊断和分流方面的表现如何还不甚了解：方法：我们比较了生成式预训练转换器 3（GPT-3）对 48 个经过验证的合成病例的诊断和分流能力的预测准确性（结果：在所有病例中，GPT-3 回答了医生和非专业人员的问题；在所有病例中，GPT-3 回答了医生和非专业人员的问题：在所有病例中，GPT-3 对 88% 的病例（42/48，95% CI 75-94）给出了前三位的正确诊断答复，而对非专业人士的答复则为 54%（2700/5000，53-55）（p解释：GPT-3 对所有病例都给出了前三位的正确诊断答复，而对非专业人士的答复则为 54%（2700/5000，53-55）：没有经过任何特定内容训练的通用人工智能语言模型的诊断水平接近但低于医生，优于非专业人士。我们发现，在分诊方面，GPT-3 的表现不如医生，有时差距还很大，而它的表现则更接近非专业人士。虽然 GPT-3 的诊断性能与医生不相上下，但它明显优于使用搜索引擎的普通人：国家心肺血液研究所。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Lancet Digital Health Multiple-

CiteScore

41.20

自引率

1.60%

发文量

232

审稿时长

13 weeks

期刊介绍： The Lancet Digital Health publishes important, innovative, and practice-changing research on any topic connected with digital technology in clinical medicine, public health, and global health. The journal’s open access content crosses subject boundaries, building bridges between health professionals and researchers.By bringing together the most important advances in this multidisciplinary field,The Lancet Digital Health is the most prominent publishing venue in digital health. We publish a range of content types including Articles,Review, Comment, and Correspondence, contributing to promoting digital technologies in health practice worldwide.