Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors.

IF 1.3 Q4 HEALTH CARE SCIENCES & SERVICES
Yukinori Harada, Tomoharu Suzuki, Taku Harada, Tetsu Sakamoto, Kosuke Ishizuka, Taiju Miyagami, Ren Kawamura, Kotaro Kunitomo, Hiroyuki Nagano, Taro Shimizu, Takashi Watari
{"title":"Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors.","authors":"Yukinori Harada, Tomoharu Suzuki, Taku Harada, Tetsu Sakamoto, Kosuke Ishizuka, Taiju Miyagami, Ren Kawamura, Kotaro Kunitomo, Hiroyuki Nagano, Taro Shimizu, Takashi Watari","doi":"10.1136/bmjoq-2023-002654","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors.</p><p><strong>Objective: </strong>This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations.</p><p><strong>Methods: </strong>We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians.</p><p><strong>Results: </strong>ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP.</p><p><strong>Conclusion: </strong>ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.</p>","PeriodicalId":9052,"journal":{"name":"BMJ Open Quality","volume":"13 2","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149143/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Open Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjoq-2023-002654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors.

Objective: This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations.

Methods: We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians.

Results: ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP.

Conclusion: ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.

ChatGPT 在检测诊断错误及其诱因方面的性能评估:对 545 份诊断错误病例报告的分析。
背景:使用经过验证的评估工具进行人工病历审查是一种检测诊断错误的标准化方法。然而,这需要大量的人力资源和时间。最近开发的基于大型语言模型的人工智能聊天机器人 ChatGPT 可以根据适当的提示对文本进行有效分类。因此,ChatGPT 可以帮助人工病历审查发现诊断错误:本研究旨在阐明 ChatGPT 是否能根据病例陈述正确检测诊断错误以及导致诊断错误的可能因素:我们分析了 545 份包含诊断错误的已发表病例报告。我们将病例陈述和最终诊断的文本以及一些原始提示输入 ChatGPT(GPT-4),以生成包括诊断错误判断和诊断错误诱因在内的回答。导致诊断错误的因素按照以下三个分类标准进行编码:诊断错误评估与研究(DEER)、可靠诊断挑战(RDC)和通用诊断陷阱(GDP)。将 ChatGPT 与医生对诱因的回答进行了比较:结果:ChatGPT 在 519/545 个病例(95%)中正确检测出了诊断错误,并对每个病例中导致诊断错误的因素进行了编码,其数量在统计学上高于医生:DEER(中位数为 5 vs 1,pConclusion):ChatGPT 可从病例陈述中准确发现诊断错误。在发现导致诊断错误的因素方面,ChatGPT 可能比人工审核更敏感,尤其是在 "非典型表现 "方面。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BMJ Open Quality
BMJ Open Quality Nursing-Leadership and Management
CiteScore
2.20
自引率
0.00%
发文量
226
审稿时长
20 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信