Pietro G Lacaita, Malik Galijasevic, Michael Swoboda, Leonhard Gruber, Yannick Scharll, Fabian Barbieri, Gerlig Widmann, Gudrun M Feuchtner
{"title":"The Accuracy of ChatGPT-4o in Interpreting Chest and Abdominal X-Ray Images.","authors":"Pietro G Lacaita, Malik Galijasevic, Michael Swoboda, Leonhard Gruber, Yannick Scharll, Fabian Barbieri, Gerlig Widmann, Gudrun M Feuchtner","doi":"10.3390/jpm15050194","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background/Objectives:</b> Large language models (LLMs), such as ChatGPT, have emerged as potential clinical support tools to enhance precision in personalized patient care, but their reliability in radiological image interpretation remains uncertain. The primary aim of our study was to evaluate the diagnostic accuracy of ChatGPT-4o in interpreting chest X-rays (CXRs) and abdominal X-rays (AXRs) by comparing its performance to expert radiology findings, whilst secondary aims were diagnostic confidence and patient safety. <b>Methods</b>: A total of 500 X-rays, including 257 CXR (51.4%) and 243 AXR (48.5%), were analyzed. Diagnoses made by ChatGPT-4o were compared to expert interpretations. Confidence scores (1-4) were assigned and responses were evaluated for patient safety. <b>Results:</b> ChatGPT-4o correctly identified 345 of 500 (69%) pathologies (95% CI: 64.81-72.9). For AXRs 175 of 243 (72.02%) pathologies were correctly diagnosed (95% CI: 66.06-77.28), while for CXRs 170 of 257 (66.15%) were accurate (95% CI: 60.16-71.66). The highest detection rates among CXRs were observed for pulmonary edema, tumor, pneumonia, pleural effusion, cardiomegaly, and emphysema, and lower rates were observed for pneumothorax, rib fractures, and enlarged mediastinum. AXR performance was highest for intestinal obstruction and foreign bodies, and weaker for pneumoperitoneum, renal calculi, and diverticulitis. Confidence scores were higher for AXRs (mean 3.45 ± 1.1) than CXRs (mean 2.48 ± 1.45). All responses (100%) were considered to be safe for the patient. Interobserver agreement was high (kappa = 0.920), and reliability (second prompt) was moderate (kappa = 0.750). <b>Conclusions:</b> ChatGPT-4o demonstrated moderate accuracy for the interpretation of X-rays, being higher for AXRs compared to CXRs. Improvements are required for its use as efficient clinical support tool.</p>","PeriodicalId":16722,"journal":{"name":"Journal of Personalized Medicine","volume":"15 5","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12113413/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Personalized Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/jpm15050194","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background/Objectives: Large language models (LLMs), such as ChatGPT, have emerged as potential clinical support tools to enhance precision in personalized patient care, but their reliability in radiological image interpretation remains uncertain. The primary aim of our study was to evaluate the diagnostic accuracy of ChatGPT-4o in interpreting chest X-rays (CXRs) and abdominal X-rays (AXRs) by comparing its performance to expert radiology findings, whilst secondary aims were diagnostic confidence and patient safety. Methods: A total of 500 X-rays, including 257 CXR (51.4%) and 243 AXR (48.5%), were analyzed. Diagnoses made by ChatGPT-4o were compared to expert interpretations. Confidence scores (1-4) were assigned and responses were evaluated for patient safety. Results: ChatGPT-4o correctly identified 345 of 500 (69%) pathologies (95% CI: 64.81-72.9). For AXRs 175 of 243 (72.02%) pathologies were correctly diagnosed (95% CI: 66.06-77.28), while for CXRs 170 of 257 (66.15%) were accurate (95% CI: 60.16-71.66). The highest detection rates among CXRs were observed for pulmonary edema, tumor, pneumonia, pleural effusion, cardiomegaly, and emphysema, and lower rates were observed for pneumothorax, rib fractures, and enlarged mediastinum. AXR performance was highest for intestinal obstruction and foreign bodies, and weaker for pneumoperitoneum, renal calculi, and diverticulitis. Confidence scores were higher for AXRs (mean 3.45 ± 1.1) than CXRs (mean 2.48 ± 1.45). All responses (100%) were considered to be safe for the patient. Interobserver agreement was high (kappa = 0.920), and reliability (second prompt) was moderate (kappa = 0.750). Conclusions: ChatGPT-4o demonstrated moderate accuracy for the interpretation of X-rays, being higher for AXRs compared to CXRs. Improvements are required for its use as efficient clinical support tool.
期刊介绍:
Journal of Personalized Medicine (JPM; ISSN 2075-4426) is an international, open access journal aimed at bringing all aspects of personalized medicine to one platform. JPM publishes cutting edge, innovative preclinical and translational scientific research and technologies related to personalized medicine (e.g., pharmacogenomics/proteomics, systems biology). JPM recognizes that personalized medicine—the assessment of genetic, environmental and host factors that cause variability of individuals—is a challenging, transdisciplinary topic that requires discussions from a range of experts. For a comprehensive perspective of personalized medicine, JPM aims to integrate expertise from the molecular and translational sciences, therapeutics and diagnostics, as well as discussions of regulatory, social, ethical and policy aspects. We provide a forum to bring together academic and clinical researchers, biotechnology, diagnostic and pharmaceutical companies, health professionals, regulatory and ethical experts, and government and regulatory authorities.