Assessing the reliability of ChatGPT4 in the appropriateness of radiology referrals

The Royal College of Radiologists Open Pub Date : 2024-01-01 DOI:10.1016/j.rcro.2024.100155

Marco Parillo , Federica Vaccarino , Daniele Vertulli , Gloria Perillo , Bruno Beomonte Zobel , Carlo Augusto Mallio

{"title":"Assessing the reliability of ChatGPT4 in the appropriateness of radiology referrals","authors":"Marco Parillo , Federica Vaccarino , Daniele Vertulli , Gloria Perillo , Bruno Beomonte Zobel , Carlo Augusto Mallio","doi":"10.1016/j.rcro.2024.100155","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).</p></div><div><h3>Method</h3><p>In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).</p></div><div><h3>Results</h3><p>RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).</p></div><div><h3>Conclusions</h3><p>ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.</p></div>","PeriodicalId":101248,"journal":{"name":"The Royal College of Radiologists Open","volume":"2 ","pages":"Article 100155"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2773066224000068/pdfft?md5=a64e9eb96e6fe951627a494b801f534c&pid=1-s2.0-S2773066224000068-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Royal College of Radiologists Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773066224000068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).

Method

In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).

Results

RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).

Conclusions

ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.

查看原文本刊更多论文

评估 ChatGPT4 在放射科转诊适当性方面的可靠性

目的研究 ChatGPT 使用检查成像报告和数据系统（RI-RADS）对成像请求进行分级的可靠性。方法在这项单中心回顾性研究中，共纳入了 450 例成像转诊。两名人类阅读者根据 RI-RADS 对所有请求进行独立评分。我们创建了一个定制的 RI-RADS GPT，将请求作为输入进行复制和粘贴，输出 RI-RADS 分数及其三个子类别的评估结果。使用皮尔逊卡方检验来评估放射科医生和 ChatGPT 分配的数据分布是否有显著差异。使用科恩卡帕（κ）评估了 RI-RADS 总分及其三个子类别的评分者之间的可靠性。结果RI-RADS D 是人类给出的最普遍的等级（54% 的病例），而 ChatGPT 更经常给出 RI-RADS C（33% 的病例）。在 2% 的病例中，ChatGPT 根据对子类别的评级错误地给出了 RI-RADS 等级。除了 RI-RADS 等级 C 和 X 外，放射科医生和 ChatGPT 之间的 RI-RADS 等级和子类别分布在统计学上存在显著差异。放射科医生和 ChatGPT 在分配 RI-RADS 分数时的可靠性非常低（κ：0.20），而两位人类读者之间的一致性几乎完美（κ：0.96）。此外，完整的成像转诊数量较少，这突出表明需要改进流程以确保放射检查申请的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Royal College of Radiologists Open

自引率

0.00%

发文量