Assessing the Performance of ChatGPT and Bard/Gemini Against Radiologists for PI-RADS Classification Based on Prostate Multiparametric MRI Text Reports.
IF 1.8 4区 医学Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
{"title":"Assessing the Performance of ChatGPT and Bard/Gemini Against Radiologists for PI-RADS Classification Based on Prostate Multiparametric MRI Text Reports.","authors":"Kang-Lung Lee, Dimitri A Kessler, Iztok Caglic, Yi-Hsin Kuo, Nadeem Shaida, Tristan Barrett","doi":"10.1093/bjr/tqae236","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Large language models (LLMs) have shown potential for clinical applications. This study assesses their ability to assign PI-RADS categories based on clinical text reports.</p><p><strong>Methods: </strong>One hundred consecutive biopsy-naïve patients' multiparametric prostate MRI reports were independently classified by two uroradiologists, GPT-3.5, GPT-4, Bard, and Gemini. Original report classifications were considered definitive.</p><p><strong>Results: </strong>Out of 100 MRIs, 52 were originally reported as PI-RADS 1-2, 9 PI-RADS 3, 19 PI-RADS 4, and 20 PI-RADS 5. Radiologists demonstrated 95% and 90% accuracy, while GPT-3.5 and Bard both achieved 67%. Accuracy of the updated versions of LLMs increased to 83% (GTP-4) and 79% (Gemini), respectively. In low suspicion studies (PI-RADS 1-2), Bard and Gemini (F1: 0.94, 0.98, respectively) outperformed GPT-3.5 and GTP-4 (F1:0.77, 0.94, respectively), whereas for high probability MRIs (PI-RADS 4-5), GPT-3.5 and GTP-4 (F1: 0.95, 0.98, respectively) outperformed Bard and Gemini (F1: 0.71, 0.87, respectively). Bard assigned a non-existent PI-RADS 6 \"hallucination\" for two patients. Inter-reader agreements (Κ) between the original reports and the senior radiologist, junior radiologist, GPT-3.5, GTP-4, BARD, and Gemini were 0.93, 0.84, 0.65, 0.86, 0.57, and 0.81, respectively.</p><p><strong>Conclusions: </strong>Radiologists demonstrated high accuracy in PI-RADS classification based on text reports, while GPT-3.5 and Bard exhibited poor performance. GTP-4 and Gemini demonstrated improved performance compared to their predecessors.</p><p><strong>Advances in knowledge: </strong>This study highlights the limitations of LLMs in accurately classifying PI-RADS categories from clinical text reports. While the performance of LLMs has improved with newer versions, caution is warranted before integrating such technologies into clinical practice.</p>","PeriodicalId":9306,"journal":{"name":"British Journal of Radiology","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/bjr/tqae236","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: Large language models (LLMs) have shown potential for clinical applications. This study assesses their ability to assign PI-RADS categories based on clinical text reports.
Methods: One hundred consecutive biopsy-naïve patients' multiparametric prostate MRI reports were independently classified by two uroradiologists, GPT-3.5, GPT-4, Bard, and Gemini. Original report classifications were considered definitive.
Results: Out of 100 MRIs, 52 were originally reported as PI-RADS 1-2, 9 PI-RADS 3, 19 PI-RADS 4, and 20 PI-RADS 5. Radiologists demonstrated 95% and 90% accuracy, while GPT-3.5 and Bard both achieved 67%. Accuracy of the updated versions of LLMs increased to 83% (GTP-4) and 79% (Gemini), respectively. In low suspicion studies (PI-RADS 1-2), Bard and Gemini (F1: 0.94, 0.98, respectively) outperformed GPT-3.5 and GTP-4 (F1:0.77, 0.94, respectively), whereas for high probability MRIs (PI-RADS 4-5), GPT-3.5 and GTP-4 (F1: 0.95, 0.98, respectively) outperformed Bard and Gemini (F1: 0.71, 0.87, respectively). Bard assigned a non-existent PI-RADS 6 "hallucination" for two patients. Inter-reader agreements (Κ) between the original reports and the senior radiologist, junior radiologist, GPT-3.5, GTP-4, BARD, and Gemini were 0.93, 0.84, 0.65, 0.86, 0.57, and 0.81, respectively.
Conclusions: Radiologists demonstrated high accuracy in PI-RADS classification based on text reports, while GPT-3.5 and Bard exhibited poor performance. GTP-4 and Gemini demonstrated improved performance compared to their predecessors.
Advances in knowledge: This study highlights the limitations of LLMs in accurately classifying PI-RADS categories from clinical text reports. While the performance of LLMs has improved with newer versions, caution is warranted before integrating such technologies into clinical practice.
期刊介绍:
BJR is the international research journal of the British Institute of Radiology and is the oldest scientific journal in the field of radiology and related sciences.
Dating back to 1896, BJR’s history is radiology’s history, and the journal has featured some landmark papers such as the first description of Computed Tomography "Computerized transverse axial tomography" by Godfrey Hounsfield in 1973. A valuable historical resource, the complete BJR archive has been digitized from 1896.
Quick Facts:
- 2015 Impact Factor – 1.840
- Receipt to first decision – average of 6 weeks
- Acceptance to online publication – average of 3 weeks
- ISSN: 0007-1285
- eISSN: 1748-880X
Open Access option