Comparing Diagnostic Accuracy of ChatGPT to Clinical Diagnosis in General Surgery Consults: A Quantitative Analysis of Disease Diagnosis.

IF 1.1 4区 医学 Q2 MEDICINE, GENERAL & INTERNAL
Heather Meier, Ryan McMahon, Brittany Hout, Joshua Randles, James Aden, Julie A Rizzo
{"title":"Comparing Diagnostic Accuracy of ChatGPT to Clinical Diagnosis in General Surgery Consults: A Quantitative Analysis of Disease Diagnosis.","authors":"Heather Meier, Ryan McMahon, Brittany Hout, Joshua Randles, James Aden, Julie A Rizzo","doi":"10.1093/milmed/usaf168","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>This study addressed the challenge of providing accurate and timely medical diagnostics in military health care settings with limited access to advanced diagnostic tools, such as those encountered in austere environments, remote locations, or during large-scale combat operations. The primary objective was to evaluate the utility of ChatGPT, an artificial intelligence (AI) language model, as a support tool for health care providers in clinical decision-making and early diagnosis.</p><p><strong>Materials and methods: </strong>The research used an observational cross-sectional cohort design and exploratory predictive techniques. The methodology involved collecting and analyzing data from clinical scenarios based on common general surgery diagnoses-acute appendicitis, acute cholecystitis, and diverticulitis. These scenarios incorporated age, gender, symptoms, vital signs, physical exam findings, laboratory values, medical and surgical histories, and current medication regimens as data inputs. All collected data were entered into a table for each diagnosis. These tables were then used for scenario creation, with scenarios written to reflect typical patient presentations for each diagnosis. Finally, each scenario was entered into ChatGPT (version 3.5) individually, with ChatGPT then being asked to provide the leading diagnosis for the condition based on the provided information. The output from ChatGPT was then compared to the expected diagnosis to assess the accuracy.</p><p><strong>Results: </strong>A statistically significant difference between ChatGPT's diagnostic outcomes and clinical diagnoses for acute cholecystitis and diverticulitis was observed, with ChatGPT demonstrating inferior accuracy in controlled test scenarios. A secondary outcome analysis looked at the relationship between specific symptoms and diagnosis. The presence of these symptoms in incorrect diagnoses indicates that they may adversely impact ChatGPT's diagnostic decision-making, resulting in a higher likelihood of misdiagnosis. These results highlight AI's potential as a diagnostic support tool but underscore the importance of continued research to evaluate its performance in more complex and varied clinical scenarios.</p><p><strong>Conclusions: </strong>In summary, this study evaluated the diagnostic accuracy of ChatGPT in identifying three common surgical conditions (acute appendicitis, acute cholecystitis, and diverticulitis) using comprehensive patient data, including age, gender, medical history, medications, symptoms, vital signs, physical exam findings, and basic laboratory results. The hypothesis was that ChatGPT might display slightly lower accuracy rates than clinical diagnoses made by medical providers. The statistical analysis, which included Fisher's exact test, revealed a significant difference between ChatGPT's diagnostic outcomes and clinical diagnoses, particularly in acute cholecystitis and diverticulitis cases. Therefore, we reject the null hypothesis, as the results indicated that ChatGPT's diagnostic accuracy significantly differs from clinical diagnostics in the presented scenarios. However, ChatGPT's overall high accuracy suggests that it can reliably support clinicians, especially in environments where diagnostic resources are limited, and can serve as a valuable tool in military medicine.</p>","PeriodicalId":18638,"journal":{"name":"Military Medicine","volume":" ","pages":"e1858-e1862"},"PeriodicalIF":1.1000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Military Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/milmed/usaf168","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: This study addressed the challenge of providing accurate and timely medical diagnostics in military health care settings with limited access to advanced diagnostic tools, such as those encountered in austere environments, remote locations, or during large-scale combat operations. The primary objective was to evaluate the utility of ChatGPT, an artificial intelligence (AI) language model, as a support tool for health care providers in clinical decision-making and early diagnosis.

Materials and methods: The research used an observational cross-sectional cohort design and exploratory predictive techniques. The methodology involved collecting and analyzing data from clinical scenarios based on common general surgery diagnoses-acute appendicitis, acute cholecystitis, and diverticulitis. These scenarios incorporated age, gender, symptoms, vital signs, physical exam findings, laboratory values, medical and surgical histories, and current medication regimens as data inputs. All collected data were entered into a table for each diagnosis. These tables were then used for scenario creation, with scenarios written to reflect typical patient presentations for each diagnosis. Finally, each scenario was entered into ChatGPT (version 3.5) individually, with ChatGPT then being asked to provide the leading diagnosis for the condition based on the provided information. The output from ChatGPT was then compared to the expected diagnosis to assess the accuracy.

Results: A statistically significant difference between ChatGPT's diagnostic outcomes and clinical diagnoses for acute cholecystitis and diverticulitis was observed, with ChatGPT demonstrating inferior accuracy in controlled test scenarios. A secondary outcome analysis looked at the relationship between specific symptoms and diagnosis. The presence of these symptoms in incorrect diagnoses indicates that they may adversely impact ChatGPT's diagnostic decision-making, resulting in a higher likelihood of misdiagnosis. These results highlight AI's potential as a diagnostic support tool but underscore the importance of continued research to evaluate its performance in more complex and varied clinical scenarios.

Conclusions: In summary, this study evaluated the diagnostic accuracy of ChatGPT in identifying three common surgical conditions (acute appendicitis, acute cholecystitis, and diverticulitis) using comprehensive patient data, including age, gender, medical history, medications, symptoms, vital signs, physical exam findings, and basic laboratory results. The hypothesis was that ChatGPT might display slightly lower accuracy rates than clinical diagnoses made by medical providers. The statistical analysis, which included Fisher's exact test, revealed a significant difference between ChatGPT's diagnostic outcomes and clinical diagnoses, particularly in acute cholecystitis and diverticulitis cases. Therefore, we reject the null hypothesis, as the results indicated that ChatGPT's diagnostic accuracy significantly differs from clinical diagnostics in the presented scenarios. However, ChatGPT's overall high accuracy suggests that it can reliably support clinicians, especially in environments where diagnostic resources are limited, and can serve as a valuable tool in military medicine.

在普外科会诊中ChatGPT与临床诊断准确率的比较:疾病诊断的定量分析。
简介:本研究解决了在军事卫生保健机构中提供准确和及时的医疗诊断的挑战,这些机构在恶劣环境、偏远地区或大规模作战行动中遇到的先进诊断工具有限。主要目的是评估ChatGPT(一种人工智能(AI)语言模型)作为医疗保健提供者在临床决策和早期诊断中的支持工具的效用。材料和方法:本研究采用观察性横断面队列设计和探索性预测技术。研究方法包括收集和分析基于普通外科诊断的临床数据,如急性阑尾炎、急性胆囊炎和憩室炎。这些情景将年龄、性别、症状、生命体征、体检结果、实验室值、医疗和手术史以及当前的药物治疗方案作为数据输入。所有收集到的数据被输入到每个诊断的表格中。然后使用这些表创建场景,并编写场景以反映每种诊断的典型患者表现。最后,将每个场景分别输入到ChatGPT(版本3.5)中,然后要求ChatGPT根据所提供的信息提供对该病症的主要诊断。然后将ChatGPT的输出与预期诊断进行比较,以评估准确性。结果:ChatGPT对急性胆囊炎和憩室炎的诊断结果与临床诊断有统计学差异,ChatGPT在对照试验场景下的准确性较差。次要结果分析着眼于特定症状和诊断之间的关系。在不正确的诊断中出现这些症状表明它们可能对ChatGPT的诊断决策产生不利影响,导致更高的误诊可能性。这些结果突出了人工智能作为诊断支持工具的潜力,但也强调了继续研究以评估其在更复杂和不同临床场景中的表现的重要性。结论:总之,本研究利用全面的患者资料,包括年龄、性别、病史、药物、症状、生命体征、体检结果和基本实验室结果,评估了ChatGPT在识别三种常见手术情况(急性阑尾炎、急性胆囊炎和憩室炎)方面的诊断准确性。他们的假设是,ChatGPT的准确率可能略低于医疗服务提供者的临床诊断。包括Fisher精确检验在内的统计分析显示,ChatGPT的诊断结果与临床诊断之间存在显著差异,特别是在急性胆囊炎和憩室炎病例中。因此,我们拒绝零假设,因为结果表明,ChatGPT的诊断准确性明显不同于临床诊断。然而,ChatGPT总体上的高准确性表明,它可以可靠地支持临床医生,特别是在诊断资源有限的环境中,并且可以作为军事医学的宝贵工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Military Medicine
Military Medicine MEDICINE, GENERAL & INTERNAL-
CiteScore
2.20
自引率
8.30%
发文量
393
审稿时长
4-8 weeks
期刊介绍: Military Medicine is the official international journal of AMSUS. Articles published in the journal are peer-reviewed scientific papers, case reports, and editorials. The journal also publishes letters to the editor. The objective of the journal is to promote awareness of federal medicine by providing a forum for responsible discussion of common ideas and problems relevant to federal healthcare. Its mission is: To increase healthcare education by providing scientific and other information to its readers; to facilitate communication; and to offer a prestige publication for members’ writings.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信