{"title":"Can AI match emergency physicians in managing common emergency cases? A comparative performance evaluation.","authors":"Mehmet Gün","doi":"10.1186/s12873-025-01303-y","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) such as ChatGPT are increasingly explored for clinical decision support. However, their performance in high-stakes emergency scenarios remains underexamined. This study aimed to evaluate ChatGPT's diagnostic and therapeutic accuracy compared to a board-certified emergency physician across diverse emergency cases.</p><p><strong>Methods: </strong>This comparative study was conducted using 15 standardized emergency scenarios sourced from validated academic platforms (Geeky Medics, Life in the Fast Lane, Emergency Medicine Cases). ChatGPT (GPT-4) and a physician independently evaluated each case based on five predefined parameters: diagnosis, investigations, initial treatment, clinical safety, and decision-making complexity. Cases were scored out of 5. Concordance was categorized as high (5/5), moderate (4/5), or low (≤ 3/5). Wilson confidence intervals (95%) were calculated for each concordance category.</p><p><strong>Results: </strong>ChatGPT achieved high concordance (5/5) in 8 cases (53.3%, 95% CI: 27.6-77.0%), moderate concordance (4/5) in 4 cases (26.7%, CI: 10.3-55.4%), and low concordance (≤ 3/5) in 3 cases (20.0%, CI: 6.0-45.6%). Performance was strongest in structured, protocol-based conditions such as STEMI, DKA, and asthma. Lower performance was observed in complex scenarios like stroke, trauma with shock, and mixed acid-base disturbances.</p><p><strong>Conclusion: </strong>ChatGPT showed strong alignment with emergency physician decisions in structured scenarios but lacked reliability in complex cases. While AI may enhance decision-making and education, it cannot replace the clinical reasoning of human physicians. Its role is best framed as a supportive tool rather than a substitute.</p>","PeriodicalId":9002,"journal":{"name":"BMC Emergency Medicine","volume":"25 1","pages":"142"},"PeriodicalIF":2.3000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12315197/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Emergency Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12873-025-01303-y","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EMERGENCY MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large language models (LLMs) such as ChatGPT are increasingly explored for clinical decision support. However, their performance in high-stakes emergency scenarios remains underexamined. This study aimed to evaluate ChatGPT's diagnostic and therapeutic accuracy compared to a board-certified emergency physician across diverse emergency cases.
Methods: This comparative study was conducted using 15 standardized emergency scenarios sourced from validated academic platforms (Geeky Medics, Life in the Fast Lane, Emergency Medicine Cases). ChatGPT (GPT-4) and a physician independently evaluated each case based on five predefined parameters: diagnosis, investigations, initial treatment, clinical safety, and decision-making complexity. Cases were scored out of 5. Concordance was categorized as high (5/5), moderate (4/5), or low (≤ 3/5). Wilson confidence intervals (95%) were calculated for each concordance category.
Results: ChatGPT achieved high concordance (5/5) in 8 cases (53.3%, 95% CI: 27.6-77.0%), moderate concordance (4/5) in 4 cases (26.7%, CI: 10.3-55.4%), and low concordance (≤ 3/5) in 3 cases (20.0%, CI: 6.0-45.6%). Performance was strongest in structured, protocol-based conditions such as STEMI, DKA, and asthma. Lower performance was observed in complex scenarios like stroke, trauma with shock, and mixed acid-base disturbances.
Conclusion: ChatGPT showed strong alignment with emergency physician decisions in structured scenarios but lacked reliability in complex cases. While AI may enhance decision-making and education, it cannot replace the clinical reasoning of human physicians. Its role is best framed as a supportive tool rather than a substitute.
期刊介绍:
BMC Emergency Medicine is an open access, peer-reviewed journal that considers articles on all urgent and emergency aspects of medicine, in both practice and basic research. In addition, the journal covers aspects of disaster medicine and medicine in special locations, such as conflict areas and military medicine, together with articles concerning healthcare services in the emergency departments.