Diagnostic Accuracy of a Large Language Model (ChatGPT-4) for Patients Admitted to a Community Hospital Medical Intensive Care Unit: A Retrospective Case Study.

IF 2.1 3区 医学 Q2 CRITICAL CARE MEDICINE
Jassimran Singh, Rhea Bohra, Vaibhavi Mukhtiar, Warren Fernandes, Charmi Bhanushali, Rajaeaswaran Chinnamuthu, Shihla Shireen Kanamgode, June Ellis, Eric Silverman
{"title":"Diagnostic Accuracy of a Large Language Model (ChatGPT-4) for Patients Admitted to a Community Hospital Medical Intensive Care Unit: A Retrospective Case Study.","authors":"Jassimran Singh, Rhea Bohra, Vaibhavi Mukhtiar, Warren Fernandes, Charmi Bhanushali, Rajaeaswaran Chinnamuthu, Shihla Shireen Kanamgode, June Ellis, Eric Silverman","doi":"10.1177/08850666251368270","DOIUrl":null,"url":null,"abstract":"<p><p>BackgroundThe future of artificial intelligence in medicine includes the use of machine learning and large language models to improve diagnostic accuracy, as a point-of-care tool, at the time of admission to an acute care hospital. The large language model, ChatGPT-4, has been shown to diagnose complex medical conditions with accuracies comparable to experienced clinicians, however, most published studies involved curated cases or examination-like questions and are not point-of-care. To test the hypothesis that ChatGPT-4 can make an accurate medical diagnosis using real-world medical cases and a convenient cut and paste strategy, we performed a retrospective case study involving critically ill patients admitted to a community hospital medical intensive care unit.MethodsA redacted H&P was essentially cut and pasted into ChatGPT-4 with uniform instructions to make a leading diagnosis and a list of 5 possibilities as a differential diagnosis. All features that could be used to identify patients were removed to ensure privacy and HIPAA compliance. The ChatGPT-4 diagnoses were compared with critical care physician diagnoses using a blinded longitudinal chart review as the ground truth diagnosis.ResultsA total of 120 randomly selected cases were included in the study. The diagnostic accuracy was 88.3% for physicians and 85.0% for ChatGPT-4, with no significant difference by McNemar testing (p-value of 0.249). The agreement between physician diagnosis and ChatGPT-4 diagnosis was moderate, 0.57 (95% CI: 0.35-0.79), based on Cohen's kappa statistic.ConclusionThese results suggest that ChatGTP-4 achieved diagnostic accuracy comparable to board certified physicians in the context of critically ill patients admitted to a community medical intensive care unit. Furthermore, the agreement was only moderate, suggesting that there may be complementary ways of combining the diagnostic acumen of physicians and ChatGPT-4 to improve overall accuracy. A prospective study would be necessary to determine if ChatGPT-4 could improve patient outcomes as a point-of-care tool at the time of admission.</p>","PeriodicalId":16307,"journal":{"name":"Journal of Intensive Care Medicine","volume":" ","pages":"8850666251368270"},"PeriodicalIF":2.1000,"publicationDate":"2025-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intensive Care Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/08850666251368270","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

BackgroundThe future of artificial intelligence in medicine includes the use of machine learning and large language models to improve diagnostic accuracy, as a point-of-care tool, at the time of admission to an acute care hospital. The large language model, ChatGPT-4, has been shown to diagnose complex medical conditions with accuracies comparable to experienced clinicians, however, most published studies involved curated cases or examination-like questions and are not point-of-care. To test the hypothesis that ChatGPT-4 can make an accurate medical diagnosis using real-world medical cases and a convenient cut and paste strategy, we performed a retrospective case study involving critically ill patients admitted to a community hospital medical intensive care unit.MethodsA redacted H&P was essentially cut and pasted into ChatGPT-4 with uniform instructions to make a leading diagnosis and a list of 5 possibilities as a differential diagnosis. All features that could be used to identify patients were removed to ensure privacy and HIPAA compliance. The ChatGPT-4 diagnoses were compared with critical care physician diagnoses using a blinded longitudinal chart review as the ground truth diagnosis.ResultsA total of 120 randomly selected cases were included in the study. The diagnostic accuracy was 88.3% for physicians and 85.0% for ChatGPT-4, with no significant difference by McNemar testing (p-value of 0.249). The agreement between physician diagnosis and ChatGPT-4 diagnosis was moderate, 0.57 (95% CI: 0.35-0.79), based on Cohen's kappa statistic.ConclusionThese results suggest that ChatGTP-4 achieved diagnostic accuracy comparable to board certified physicians in the context of critically ill patients admitted to a community medical intensive care unit. Furthermore, the agreement was only moderate, suggesting that there may be complementary ways of combining the diagnostic acumen of physicians and ChatGPT-4 to improve overall accuracy. A prospective study would be necessary to determine if ChatGPT-4 could improve patient outcomes as a point-of-care tool at the time of admission.

大型语言模型(ChatGPT-4)对社区医院重症监护病房住院患者的诊断准确性:回顾性案例研究
人工智能在医学领域的未来包括使用机器学习和大型语言模型来提高诊断准确性,作为急诊医院入院时的护理点工具。大型语言模型ChatGPT-4已被证明可以诊断复杂的医疗状况,其准确性可与经验丰富的临床医生相媲美,然而,大多数已发表的研究涉及精心设计的病例或类似于检查的问题,而不是即时护理。为了验证ChatGPT-4可以使用真实医疗病例和方便的剪切粘贴策略进行准确医学诊断的假设,我们对社区医院重症监护病房收治的危重患者进行了回顾性病例研究。方法将已编辑的H&P剪切粘贴到ChatGPT-4中,并按照统一的指示进行主要诊断,并列出5种可能性作为鉴别诊断。所有可用于识别患者的功能都被删除,以确保隐私和HIPAA合规性。ChatGPT-4诊断与重症监护医师诊断进行比较,采用盲法纵向图表回顾作为基础真实诊断。结果随机抽取120例病例纳入研究。医师的诊断准确率为88.3%,ChatGPT-4的诊断准确率为85.0%,McNemar检验差异无统计学意义(p值为0.249)。根据Cohen's kappa统计,医师诊断与ChatGPT-4诊断的一致性为中等,为0.57 (95% CI: 0.35-0.79)。这些结果表明,ChatGTP-4在社区医疗重症监护病房收治的重症患者中达到了与委员会认证医生相当的诊断准确性。此外,该协议只是适度的,这表明可能存在将医生的诊断敏锐度与ChatGPT-4相结合的互补方法,以提高整体准确性。有必要进行前瞻性研究,以确定ChatGPT-4作为入院时的即时护理工具是否可以改善患者的预后。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Intensive Care Medicine
Journal of Intensive Care Medicine CRITICAL CARE MEDICINE-
CiteScore
7.60
自引率
3.20%
发文量
107
期刊介绍: Journal of Intensive Care Medicine (JIC) is a peer-reviewed bi-monthly journal offering medical and surgical clinicians in adult and pediatric intensive care state-of-the-art, broad-based analytic reviews and updates, original articles, reports of large clinical series, techniques and procedures, topic-specific electronic resources, book reviews, and editorials on all aspects of intensive/critical/coronary care.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信