Assessing the performance of ChatGPT and Bard/Gemini against radiologists for Prostate Imaging-Reporting and Data System classification based on prostate multiparametric MRI text reports.

IF 1.8 4区医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

British Journal of Radiology Pub Date : 2025-03-01 DOI:10.1093/bjr/tqae236

Kang-Lung Lee, Dimitri A Kessler, Iztok Caglic, Yi-Hsin Kuo, Nadeem Shaida, Tristan Barrett

{"title":"Assessing the performance of ChatGPT and Bard/Gemini against radiologists for Prostate Imaging-Reporting and Data System classification based on prostate multiparametric MRI text reports.","authors":"Kang-Lung Lee, Dimitri A Kessler, Iztok Caglic, Yi-Hsin Kuo, Nadeem Shaida, Tristan Barrett","doi":"10.1093/bjr/tqae236","DOIUrl":null,"url":null,"abstract":"Objectives: Large language models (LLMs) have shown potential for clinical applications. This study assesses their ability to assign Prostate Imaging-Reporting and Data System (PI-RADS) categories based on clinical text reports.Methods: One hundred consecutive biopsy-naïve patients' multiparametric prostate MRI reports were independently classified by 2 uroradiologists, ChatGPT-3.5 (GPT-3.5), ChatGPT-4o mini (GPT-4), Bard, and Gemini. Original report classifications were considered definitive.Results: Out of 100 MRIs, 52 were originally reported as PI-RADS 1-2, 9 PI-RADS 3, 19 PI-RADS 4, and 20 PI-RADS 5. Radiologists demonstrated 95% and 90% accuracy, while GPT-3.5 and Bard both achieved 67%. Accuracy of the updated versions of LLMs increased to 83% (GTP-4) and 79% (Gemini), respectively. In low suspicion studies (PI-RADS 1-2), Bard and Gemini (F1: 0.94, 0.98, respectively) outperformed GPT-3.5 and GTP-4 (F1:0.77, 0.94, respectively), whereas for high probability MRIs (PI-RADS 4-5), GPT-3.5 and GTP-4 (F1: 0.95, 0.98, respectively) outperformed Bard and Gemini (F1: 0.71, 0.87, respectively). Bard assigned a non-existent PI-RADS 6 \"hallucination\" for 2 patients. Inter-reader agreements (Κ) between the original reports and the senior radiologist, junior radiologist, GPT-3.5, GTP-4, BARD, and Gemini were 0.93, 0.84, 0.65, 0.86, 0.57, and 0.81, respectively.Conclusions: Radiologists demonstrated high accuracy in PI-RADS classification based on text reports, while GPT-3.5 and Bard exhibited poor performance. GTP-4 and Gemini demonstrated improved performance compared to their predecessors.Advances in knowledge: This study highlights the limitations of LLMs in accurately classifying PI-RADS categories from clinical text reports. While the performance of LLMs has improved with newer versions, caution is warranted before integrating such technologies into clinical practice.","PeriodicalId":9306,"journal":{"name":"British Journal of Radiology","volume":" ","pages":"368-374"},"PeriodicalIF":1.8000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11840166/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/bjr/tqae236","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: Large language models (LLMs) have shown potential for clinical applications. This study assesses their ability to assign Prostate Imaging-Reporting and Data System (PI-RADS) categories based on clinical text reports.

Methods: One hundred consecutive biopsy-naïve patients' multiparametric prostate MRI reports were independently classified by 2 uroradiologists, ChatGPT-3.5 (GPT-3.5), ChatGPT-4o mini (GPT-4), Bard, and Gemini. Original report classifications were considered definitive.

Results: Out of 100 MRIs, 52 were originally reported as PI-RADS 1-2, 9 PI-RADS 3, 19 PI-RADS 4, and 20 PI-RADS 5. Radiologists demonstrated 95% and 90% accuracy, while GPT-3.5 and Bard both achieved 67%. Accuracy of the updated versions of LLMs increased to 83% (GTP-4) and 79% (Gemini), respectively. In low suspicion studies (PI-RADS 1-2), Bard and Gemini (F1: 0.94, 0.98, respectively) outperformed GPT-3.5 and GTP-4 (F1:0.77, 0.94, respectively), whereas for high probability MRIs (PI-RADS 4-5), GPT-3.5 and GTP-4 (F1: 0.95, 0.98, respectively) outperformed Bard and Gemini (F1: 0.71, 0.87, respectively). Bard assigned a non-existent PI-RADS 6 "hallucination" for 2 patients. Inter-reader agreements (Κ) between the original reports and the senior radiologist, junior radiologist, GPT-3.5, GTP-4, BARD, and Gemini were 0.93, 0.84, 0.65, 0.86, 0.57, and 0.81, respectively.

Conclusions: Radiologists demonstrated high accuracy in PI-RADS classification based on text reports, while GPT-3.5 and Bard exhibited poor performance. GTP-4 and Gemini demonstrated improved performance compared to their predecessors.

Advances in knowledge: This study highlights the limitations of LLMs in accurately classifying PI-RADS categories from clinical text reports. While the performance of LLMs has improved with newer versions, caution is warranted before integrating such technologies into clinical practice.

查看原文本刊更多论文

基于前列腺多参数磁共振成像文本报告，评估 ChatGPT 和 Bard/Gemini 在 PI-RADS 分类方面与放射科医生的对比性能。

目的：大语言模型（LLMs）已显示出临床应用的潜力。本研究评估了它们根据临床文本报告分配 PI-RADS 类别的能力：方法：由两名泌尿放射科医生对 100 名连续的未经活检患者的多参数前列腺 MRI 报告、GPT-3.5、GPT-4、Bard 和 Gemini 进行独立分类。原始报告分类被认为是确定的：在 100 例 MRI 中，52 例最初报告为 PI-RADS 1-2，9 例 PI-RADS 3，19 例 PI-RADS 4，20 例 PI-RADS 5。放射医师的准确率分别为 95% 和 90%，而 GPT-3.5 和 Bard 的准确率均为 67%。更新版 LLM 的准确率分别提高到 83%（GTP-4）和 79%（Gemini）。在低怀疑研究（PI-RADS 1-2）中，Bard 和 Gemini（F1：分别为 0.94 和 0.98）的表现优于 GPT-3.5 和 GTP-4（F1：分别为 0.77 和 0.94），而在高概率 MRI（PI-RADS 4-5）中，GPT-3.5 和 GTP-4（F1：分别为 0.95 和 0.98）的表现优于 Bard 和 Gemini（F1：分别为 0.71 和 0.87）。Bard 为两名患者指定了不存在的 PI-RADS 6 "幻觉"。原始报告与资深放射科医师、初级放射科医师、GPT-3.5、GTP-4、BARD 和 Gemini 之间的读片者间一致性（Κ）分别为 0.93、0.84、0.65、0.86、0.57 和 0.81：放射医师根据文本报告进行 PI-RADS 分类的准确率较高，而 GPT-3.5 和 Bard 的表现较差。GTP-4 和 Gemini 的性能则比其前辈有所提高：本研究强调了 LLM 在从临床文本报告中准确划分 PI-RADS 类别方面的局限性。虽然 LLMs 的性能随着版本的更新而有所提高，但在将此类技术应用于临床实践之前仍需谨慎。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

British Journal of Radiology 医学-核医学

CiteScore

5.30

自引率

3.80%

发文量

330

审稿时长

2-4 weeks

期刊介绍： BJR is the international research journal of the British Institute of Radiology and is the oldest scientific journal in the field of radiology and related sciences. Dating back to 1896, BJR’s history is radiology’s history, and the journal has featured some landmark papers such as the first description of Computed Tomography "Computerized transverse axial tomography" by Godfrey Hounsfield in 1973. A valuable historical resource, the complete BJR archive has been digitized from 1896. Quick Facts: - 2015 Impact Factor – 1.840 - Receipt to first decision – average of 6 weeks - Acceptance to online publication – average of 3 weeks - ISSN: 0007-1285 - eISSN: 1748-880X Open Access option