The Treasure Trove Hidden in Plain Sight: The Utility of GPT-4 in Chest Radiograph Evaluation.

IF 12.1 1区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Radiology Pub Date : 2024-11-01 DOI:10.1148/radiol.233441

Soroosh Tayebi Arasteh, Robert Siepmann, Marc Huppertz, Mahshad Lotfinia, Behrus Puladi, Christiane Kuhl, Daniel Truhn, Sven Nebelung

{"title":"The Treasure Trove Hidden in Plain Sight: The Utility of GPT-4 in Chest Radiograph Evaluation.","authors":"Soroosh Tayebi Arasteh, Robert Siepmann, Marc Huppertz, Mahshad Lotfinia, Behrus Puladi, Christiane Kuhl, Daniel Truhn, Sven Nebelung","doi":"10.1148/radiol.233441","DOIUrl":null,"url":null,"abstract":"Background Limited statistical knowledge can slow critical engagement with and adoption of artificial intelligence (AI) tools for radiologists. Large language models (LLMs) such as OpenAI's GPT-4, and notably its Advanced Data Analysis (ADA) extension, may improve the adoption of AI in radiology. Purpose To validate GPT-4 ADA outputs when autonomously conducting analyses of varying complexity on a multisource clinical dataset. Materials and Methods In this retrospective study, unique itemized radiologic reports of bedside chest radiographs, associated demographic data, and laboratory markers of inflammation from patients in intensive care from January 2009 to December 2019 were evaluated. GPT-4 ADA, accessed between December 2023 and January 2024, was tasked with autonomously analyzing this dataset by plotting radiography usage rates, providing descriptive statistics measures, quantifying factors of pulmonary opacities, and setting up machine learning (ML) models to predict their presence. Three scientists with 6-10 years of ML experience validated the outputs by verifying the methodology, assessing coding quality, re-executing the provided code, and comparing ML models head-to-head with their human-developed counterparts (based on the area under the receiver operating characteristic curve [AUC], accuracy, sensitivity, and specificity). Statistical significance was evaluated using bootstrapping. Results A total of 43 788 radiograph reports, with their laboratory values, from University Hospital RWTH Aachen were evaluated from 43 788 patients (mean age, 66 years ± 15 [SD]; 26 804 male). While GPT-4 ADA provided largely appropriate visualizations, descriptive statistical measures, quantitative statistical associations based on logistic regression, and gradient boosting machines for the predictive task (AUC, 0.75), some statistical errors and inaccuracies were encountered. ML strategies were valid and based on consistent coding routines, resulting in valid outputs on par with human specialist-developed reference models (AUC, 0.80 [95% CI: 0.80, 0.81] vs 0.80 [95% CI: 0.80, 0.81]; P = .51) (accuracy, 79% [6910 of 8758 patients] vs 78% [6875 of 8758 patients], respectively; P = .27). Conclusion LLMs may facilitate data analysis in radiology, from basic statistics to advanced ML-based predictive modeling. © RSNA, 2024 Supplemental material is available for this article.","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"313 2","pages":"e233441"},"PeriodicalIF":12.1000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.233441","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Background Limited statistical knowledge can slow critical engagement with and adoption of artificial intelligence (AI) tools for radiologists. Large language models (LLMs) such as OpenAI's GPT-4, and notably its Advanced Data Analysis (ADA) extension, may improve the adoption of AI in radiology. Purpose To validate GPT-4 ADA outputs when autonomously conducting analyses of varying complexity on a multisource clinical dataset. Materials and Methods In this retrospective study, unique itemized radiologic reports of bedside chest radiographs, associated demographic data, and laboratory markers of inflammation from patients in intensive care from January 2009 to December 2019 were evaluated. GPT-4 ADA, accessed between December 2023 and January 2024, was tasked with autonomously analyzing this dataset by plotting radiography usage rates, providing descriptive statistics measures, quantifying factors of pulmonary opacities, and setting up machine learning (ML) models to predict their presence. Three scientists with 6-10 years of ML experience validated the outputs by verifying the methodology, assessing coding quality, re-executing the provided code, and comparing ML models head-to-head with their human-developed counterparts (based on the area under the receiver operating characteristic curve [AUC], accuracy, sensitivity, and specificity). Statistical significance was evaluated using bootstrapping. Results A total of 43 788 radiograph reports, with their laboratory values, from University Hospital RWTH Aachen were evaluated from 43 788 patients (mean age, 66 years ± 15 [SD]; 26 804 male). While GPT-4 ADA provided largely appropriate visualizations, descriptive statistical measures, quantitative statistical associations based on logistic regression, and gradient boosting machines for the predictive task (AUC, 0.75), some statistical errors and inaccuracies were encountered. ML strategies were valid and based on consistent coding routines, resulting in valid outputs on par with human specialist-developed reference models (AUC, 0.80 [95% CI: 0.80, 0.81] vs 0.80 [95% CI: 0.80, 0.81]; P = .51) (accuracy, 79% [6910 of 8758 patients] vs 78% [6875 of 8758 patients], respectively; P = .27). Conclusion LLMs may facilitate data analysis in radiology, from basic statistics to advanced ML-based predictive modeling. © RSNA, 2024 Supplemental material is available for this article.

查看原文本刊更多论文

隐藏在众目睽睽之下的宝库：GPT-4 在胸片评估中的实用性。

背景有限的统计知识可能会减缓放射科医生对人工智能（AI）工具的关键参与和采用。大型语言模型（LLM），如 OpenAI 的 GPT-4，尤其是其高级数据分析（ADA）扩展，可能会提高人工智能在放射学中的应用。目的验证 GPT-4 ADA 在自主对多源临床数据集进行不同复杂度分析时的输出结果。材料和方法在这项回顾性研究中，我们评估了 2009 年 1 月至 2019 年 12 月期间重症监护患者床旁胸片的独特逐项放射报告、相关人口统计学数据和炎症实验室标记物。在 2023 年 12 月至 2024 年 1 月期间访问的 GPT-4 ADA 的任务是通过绘制射线照相使用率、提供描述性统计量、量化肺不张因素以及建立机器学习（ML）模型来预测肺不张的存在，从而自主分析该数据集。三位拥有 6-10 年机器学习经验的科学家通过验证方法、评估编码质量、重新执行所提供的代码以及比较机器学习模型和人类开发的模型（基于接收者操作特征曲线下面积 [AUC]、准确性、灵敏度和特异性）来验证输出结果。统计意义采用引导法进行评估。结果对亚琛工业大学医院的 43 788 名患者（平均年龄为 66 岁 ± 15 [SD]；26 804 名男性）的 43 788 份放射照片报告及其化验值进行了评估。虽然 GPT-4 ADA 在很大程度上提供了适当的可视化效果、描述性统计量、基于逻辑回归的定量统计关联以及用于预测任务的梯度提升机（AUC，0.75），但也遇到了一些统计错误和不准确性。ML 策略是有效的，并基于一致的编码程序，其有效输出与人类专家开发的参考模型相当（AUC，0.80 [95% CI：0.80, 0.81] vs 0.80 [95% CI：0.80, 0.81]；P = .51）（准确率，分别为 79% [8758 例患者中的 6910 例] vs 78% [8758 例患者中的 6875 例]；P = .27）。结论 LLM 可促进放射学的数据分析，从基础统计到基于 ML 的高级预测建模。© RSNA, 2024 本文有补充材料。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Radiology 医学-核医学

CiteScore

35.20

自引率

3.00%

发文量

596

审稿时长

3.6 months

期刊介绍： Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies. Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.