Computerized diagnostic decision support systems - a comparative performance study of Isabel Pro vs. ChatGPT4.

IF 2.2 Q2 MEDICINE, GENERAL & INTERNAL

Diagnosis Pub Date : 2024-05-07 eCollection Date: 2024-08-01 DOI:10.1515/dx-2024-0033

Joe M Bridges

{"title":"Computerized diagnostic decision support systems - a comparative performance study of Isabel Pro vs. ChatGPT4.","authors":"Joe M Bridges","doi":"10.1515/dx-2024-0033","DOIUrl":null,"url":null,"abstract":"Objectives: Validate the diagnostic accuracy of the Artificial Intelligence Large Language Model ChatGPT4 by comparing diagnosis lists produced by ChatGPT4 to Isabel Pro.Methods: This study used 201 cases, comparing ChatGPT4 to Isabel Pro. Systems inputs were identical. Mean Reciprocal Rank (MRR) compares the correct diagnosis's rank between systems. Isabel Pro ranks by the frequency with which the symptoms appear in the reference dataset. The mechanism ChatGPT4 uses to rank the diagnoses is unknown. A Wilcoxon Signed Rank Sum test failed to reject the null hypothesis.Results: Both systems produced comprehensive differential diagnosis lists. Isabel Pro's list appears immediately upon submission, while ChatGPT4 takes several minutes. Isabel Pro produced 175 (87.1 %) correct diagnoses and ChatGPT4 165 (82.1 %). The MRR for ChatGPT4 was 0.428 (rank 2.31), and Isabel Pro was 0.389 (rank 2.57), an average rank of three for each. ChatGPT4 outperformed on Recall at Rank 1, 5, and 10, with Isabel Pro outperforming at 20, 30, and 40. The Wilcoxon Signed Rank Sum Test confirmed that the sample size was inadequate to conclude that the systems are equivalent. ChatGPT4 fabricated citations and DOIs, producing 145 correct references (87.9 %) but only 52 correct DOIs (31.5 %).Conclusions: This study validates the promise of Clinical Diagnostic Decision Support Systems, including the Large Language Model form of artificial intelligence (AI). Until the issue of hallucination of references and, perhaps diagnoses, is resolved in favor of absolute accuracy, clinicians will make cautious use of Large Language Model systems in diagnosis, if at all.","PeriodicalId":11273,"journal":{"name":"Diagnosis","volume":" ","pages":"250-258"},"PeriodicalIF":2.2000,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnosis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/dx-2024-0033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: Validate the diagnostic accuracy of the Artificial Intelligence Large Language Model ChatGPT4 by comparing diagnosis lists produced by ChatGPT4 to Isabel Pro.

Methods: This study used 201 cases, comparing ChatGPT4 to Isabel Pro. Systems inputs were identical. Mean Reciprocal Rank (MRR) compares the correct diagnosis's rank between systems. Isabel Pro ranks by the frequency with which the symptoms appear in the reference dataset. The mechanism ChatGPT4 uses to rank the diagnoses is unknown. A Wilcoxon Signed Rank Sum test failed to reject the null hypothesis.

Results: Both systems produced comprehensive differential diagnosis lists. Isabel Pro's list appears immediately upon submission, while ChatGPT4 takes several minutes. Isabel Pro produced 175 (87.1 %) correct diagnoses and ChatGPT4 165 (82.1 %). The MRR for ChatGPT4 was 0.428 (rank 2.31), and Isabel Pro was 0.389 (rank 2.57), an average rank of three for each. ChatGPT4 outperformed on Recall at Rank 1, 5, and 10, with Isabel Pro outperforming at 20, 30, and 40. The Wilcoxon Signed Rank Sum Test confirmed that the sample size was inadequate to conclude that the systems are equivalent. ChatGPT4 fabricated citations and DOIs, producing 145 correct references (87.9 %) but only 52 correct DOIs (31.5 %).

Conclusions: This study validates the promise of Clinical Diagnostic Decision Support Systems, including the Large Language Model form of artificial intelligence (AI). Until the issue of hallucination of references and, perhaps diagnoses, is resolved in favor of absolute accuracy, clinicians will make cautious use of Large Language Model systems in diagnosis, if at all.

查看原文本刊更多论文

计算机诊断决策支持系统--伊莎贝尔专业版与 ChatGPT4 的性能比较研究。

目标：验证人工智能大语言模型 ChatGPT4 的诊断准确性：通过比较人工智能大语言模型 ChatGPT4 和伊莎贝尔专业版的诊断列表，验证人工智能大语言模型 ChatGPT4 的诊断准确性：本研究使用 201 个病例，比较 ChatGPT4 和伊莎贝尔专业版。系统输入完全相同。平均互易等级（MRR）比较系统间正确诊断的等级。伊莎贝尔专业版根据症状在参考数据集中出现的频率进行排名。ChatGPT4 用来对诊断进行排序的机制尚不清楚。Wilcoxon Signed Rank Sum 检验未能拒绝零假设：结果：两个系统都生成了全面的鉴别诊断列表。伊莎贝尔专业版的列表在提交后立即显示，而 ChatGPT4 则需要几分钟。伊莎贝尔专业版的诊断正确率为 175%（87.1%），而 ChatGPT4 为 165%（82.1%）。ChatGPT4 的 MRR 为 0.428（排名 2.31），伊莎贝尔专业版的 MRR 为 0.389（排名 2.57），平均排名均为 3。ChatGPT4 在第 1、5 和 10 级的召回率上表现更好，而伊莎贝尔专业版在第 20、30 和 40 级的召回率上表现更好。Wilcoxon 符号秩和检验证实，样本量不足以得出这两个系统相当的结论。ChatGPT4 伪造了引用和 DOI，产生了 145 条正确的引用（87.9%），但只有 52 条正确的 DOI（31.5%）：这项研究验证了临床诊断决策支持系统的前景，包括大语言模型形式的人工智能（AI）。在参考文献和诊断的幻觉问题得到解决之前，临床医生将谨慎使用大语言模型系统进行诊断（如果有的话）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Diagnosis MEDICINE, GENERAL & INTERNAL-

CiteScore

7.20

自引率

5.70%

发文量

期刊介绍： Diagnosis focuses on how diagnosis can be advanced, how it is taught, and how and why it can fail, leading to diagnostic errors. The journal welcomes both fundamental and applied works, improvement initiatives, opinions, and debates to encourage new thinking on improving this critical aspect of healthcare quality.　 Topics: -Factors that promote diagnostic quality and safety -Clinical reasoning -Diagnostic errors in medicine -The factors that contribute to diagnostic error: human factors, cognitive issues, and system-related breakdowns -Improving the value of diagnosis – eliminating waste and unnecessary testing -How culture and removing blame promote awareness of diagnostic errors -Training and education related to clinical reasoning and diagnostic skills -Advances in laboratory testing and imaging that improve diagnostic capability -Local, national and international initiatives to reduce diagnostic error