Diagnostic Accuracy of a Large Language Model (ChatGPT-4) for Patients Admitted to a Community Hospital Medical Intensive Care Unit: A Retrospective Case Study.
Jassimran Singh, Rhea Bohra, Vaibhavi Mukhtiar, Warren Fernandes, Charmi Bhanushali, Rajaeaswaran Chinnamuthu, Shihla Shireen Kanamgode, June Ellis, Eric Silverman
{"title":"Diagnostic Accuracy of a Large Language Model (ChatGPT-4) for Patients Admitted to a Community Hospital Medical Intensive Care Unit: A Retrospective Case Study.","authors":"Jassimran Singh, Rhea Bohra, Vaibhavi Mukhtiar, Warren Fernandes, Charmi Bhanushali, Rajaeaswaran Chinnamuthu, Shihla Shireen Kanamgode, June Ellis, Eric Silverman","doi":"10.1177/08850666251368270","DOIUrl":null,"url":null,"abstract":"<p><p>BackgroundThe future of artificial intelligence in medicine includes the use of machine learning and large language models to improve diagnostic accuracy, as a point-of-care tool, at the time of admission to an acute care hospital. The large language model, ChatGPT-4, has been shown to diagnose complex medical conditions with accuracies comparable to experienced clinicians, however, most published studies involved curated cases or examination-like questions and are not point-of-care. To test the hypothesis that ChatGPT-4 can make an accurate medical diagnosis using real-world medical cases and a convenient cut and paste strategy, we performed a retrospective case study involving critically ill patients admitted to a community hospital medical intensive care unit.MethodsA redacted H&P was essentially cut and pasted into ChatGPT-4 with uniform instructions to make a leading diagnosis and a list of 5 possibilities as a differential diagnosis. All features that could be used to identify patients were removed to ensure privacy and HIPAA compliance. The ChatGPT-4 diagnoses were compared with critical care physician diagnoses using a blinded longitudinal chart review as the ground truth diagnosis.ResultsA total of 120 randomly selected cases were included in the study. The diagnostic accuracy was 88.3% for physicians and 85.0% for ChatGPT-4, with no significant difference by McNemar testing (p-value of 0.249). The agreement between physician diagnosis and ChatGPT-4 diagnosis was moderate, 0.57 (95% CI: 0.35-0.79), based on Cohen's kappa statistic.ConclusionThese results suggest that ChatGTP-4 achieved diagnostic accuracy comparable to board certified physicians in the context of critically ill patients admitted to a community medical intensive care unit. Furthermore, the agreement was only moderate, suggesting that there may be complementary ways of combining the diagnostic acumen of physicians and ChatGPT-4 to improve overall accuracy. A prospective study would be necessary to determine if ChatGPT-4 could improve patient outcomes as a point-of-care tool at the time of admission.</p>","PeriodicalId":16307,"journal":{"name":"Journal of Intensive Care Medicine","volume":" ","pages":"8850666251368270"},"PeriodicalIF":2.1000,"publicationDate":"2025-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intensive Care Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/08850666251368270","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
BackgroundThe future of artificial intelligence in medicine includes the use of machine learning and large language models to improve diagnostic accuracy, as a point-of-care tool, at the time of admission to an acute care hospital. The large language model, ChatGPT-4, has been shown to diagnose complex medical conditions with accuracies comparable to experienced clinicians, however, most published studies involved curated cases or examination-like questions and are not point-of-care. To test the hypothesis that ChatGPT-4 can make an accurate medical diagnosis using real-world medical cases and a convenient cut and paste strategy, we performed a retrospective case study involving critically ill patients admitted to a community hospital medical intensive care unit.MethodsA redacted H&P was essentially cut and pasted into ChatGPT-4 with uniform instructions to make a leading diagnosis and a list of 5 possibilities as a differential diagnosis. All features that could be used to identify patients were removed to ensure privacy and HIPAA compliance. The ChatGPT-4 diagnoses were compared with critical care physician diagnoses using a blinded longitudinal chart review as the ground truth diagnosis.ResultsA total of 120 randomly selected cases were included in the study. The diagnostic accuracy was 88.3% for physicians and 85.0% for ChatGPT-4, with no significant difference by McNemar testing (p-value of 0.249). The agreement between physician diagnosis and ChatGPT-4 diagnosis was moderate, 0.57 (95% CI: 0.35-0.79), based on Cohen's kappa statistic.ConclusionThese results suggest that ChatGTP-4 achieved diagnostic accuracy comparable to board certified physicians in the context of critically ill patients admitted to a community medical intensive care unit. Furthermore, the agreement was only moderate, suggesting that there may be complementary ways of combining the diagnostic acumen of physicians and ChatGPT-4 to improve overall accuracy. A prospective study would be necessary to determine if ChatGPT-4 could improve patient outcomes as a point-of-care tool at the time of admission.
期刊介绍:
Journal of Intensive Care Medicine (JIC) is a peer-reviewed bi-monthly journal offering medical and surgical clinicians in adult and pediatric intensive care state-of-the-art, broad-based analytic reviews and updates, original articles, reports of large clinical series, techniques and procedures, topic-specific electronic resources, book reviews, and editorials on all aspects of intensive/critical/coronary care.