Marco Parillo , Federica Vaccarino , Daniele Vertulli , Gloria Perillo , Bruno Beomonte Zobel , Carlo Augusto Mallio
{"title":"评估 ChatGPT4 在放射科转诊适当性方面的可靠性","authors":"Marco Parillo , Federica Vaccarino , Daniele Vertulli , Gloria Perillo , Bruno Beomonte Zobel , Carlo Augusto Mallio","doi":"10.1016/j.rcro.2024.100155","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).</p></div><div><h3>Method</h3><p>In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).</p></div><div><h3>Results</h3><p>RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).</p></div><div><h3>Conclusions</h3><p>ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.</p></div>","PeriodicalId":101248,"journal":{"name":"The Royal College of Radiologists Open","volume":"2 ","pages":"Article 100155"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2773066224000068/pdfft?md5=a64e9eb96e6fe951627a494b801f534c&pid=1-s2.0-S2773066224000068-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Assessing the reliability of ChatGPT4 in the appropriateness of radiology referrals\",\"authors\":\"Marco Parillo , Federica Vaccarino , Daniele Vertulli , Gloria Perillo , Bruno Beomonte Zobel , Carlo Augusto Mallio\",\"doi\":\"10.1016/j.rcro.2024.100155\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Purpose</h3><p>To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).</p></div><div><h3>Method</h3><p>In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).</p></div><div><h3>Results</h3><p>RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).</p></div><div><h3>Conclusions</h3><p>ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.</p></div>\",\"PeriodicalId\":101248,\"journal\":{\"name\":\"The Royal College of Radiologists Open\",\"volume\":\"2 \",\"pages\":\"Article 100155\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2773066224000068/pdfft?md5=a64e9eb96e6fe951627a494b801f534c&pid=1-s2.0-S2773066224000068-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Royal College of Radiologists Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2773066224000068\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Royal College of Radiologists Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773066224000068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Assessing the reliability of ChatGPT4 in the appropriateness of radiology referrals
Purpose
To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).
Method
In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).
Results
RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).
Conclusions
ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.