Assessing the Performance of ChatGPT and Bard/Gemini Against Radiologists for PI-RADS Classification Based on Prostate Multiparametric MRI Text Reports.

IF 1.8 4区 医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Kang-Lung Lee, Dimitri A Kessler, Iztok Caglic, Yi-Hsin Kuo, Nadeem Shaida, Tristan Barrett
{"title":"Assessing the Performance of ChatGPT and Bard/Gemini Against Radiologists for PI-RADS Classification Based on Prostate Multiparametric MRI Text Reports.","authors":"Kang-Lung Lee, Dimitri A Kessler, Iztok Caglic, Yi-Hsin Kuo, Nadeem Shaida, Tristan Barrett","doi":"10.1093/bjr/tqae236","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Large language models (LLMs) have shown potential for clinical applications. This study assesses their ability to assign PI-RADS categories based on clinical text reports.</p><p><strong>Methods: </strong>One hundred consecutive biopsy-naïve patients' multiparametric prostate MRI reports were independently classified by two uroradiologists, GPT-3.5, GPT-4, Bard, and Gemini. Original report classifications were considered definitive.</p><p><strong>Results: </strong>Out of 100 MRIs, 52 were originally reported as PI-RADS 1-2, 9 PI-RADS 3, 19 PI-RADS 4, and 20 PI-RADS 5. Radiologists demonstrated 95% and 90% accuracy, while GPT-3.5 and Bard both achieved 67%. Accuracy of the updated versions of LLMs increased to 83% (GTP-4) and 79% (Gemini), respectively. In low suspicion studies (PI-RADS 1-2), Bard and Gemini (F1: 0.94, 0.98, respectively) outperformed GPT-3.5 and GTP-4 (F1:0.77, 0.94, respectively), whereas for high probability MRIs (PI-RADS 4-5), GPT-3.5 and GTP-4 (F1: 0.95, 0.98, respectively) outperformed Bard and Gemini (F1: 0.71, 0.87, respectively). Bard assigned a non-existent PI-RADS 6 \"hallucination\" for two patients. Inter-reader agreements (Κ) between the original reports and the senior radiologist, junior radiologist, GPT-3.5, GTP-4, BARD, and Gemini were 0.93, 0.84, 0.65, 0.86, 0.57, and 0.81, respectively.</p><p><strong>Conclusions: </strong>Radiologists demonstrated high accuracy in PI-RADS classification based on text reports, while GPT-3.5 and Bard exhibited poor performance. GTP-4 and Gemini demonstrated improved performance compared to their predecessors.</p><p><strong>Advances in knowledge: </strong>This study highlights the limitations of LLMs in accurately classifying PI-RADS categories from clinical text reports. While the performance of LLMs has improved with newer versions, caution is warranted before integrating such technologies into clinical practice.</p>","PeriodicalId":9306,"journal":{"name":"British Journal of Radiology","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/bjr/tqae236","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: Large language models (LLMs) have shown potential for clinical applications. This study assesses their ability to assign PI-RADS categories based on clinical text reports.

Methods: One hundred consecutive biopsy-naïve patients' multiparametric prostate MRI reports were independently classified by two uroradiologists, GPT-3.5, GPT-4, Bard, and Gemini. Original report classifications were considered definitive.

Results: Out of 100 MRIs, 52 were originally reported as PI-RADS 1-2, 9 PI-RADS 3, 19 PI-RADS 4, and 20 PI-RADS 5. Radiologists demonstrated 95% and 90% accuracy, while GPT-3.5 and Bard both achieved 67%. Accuracy of the updated versions of LLMs increased to 83% (GTP-4) and 79% (Gemini), respectively. In low suspicion studies (PI-RADS 1-2), Bard and Gemini (F1: 0.94, 0.98, respectively) outperformed GPT-3.5 and GTP-4 (F1:0.77, 0.94, respectively), whereas for high probability MRIs (PI-RADS 4-5), GPT-3.5 and GTP-4 (F1: 0.95, 0.98, respectively) outperformed Bard and Gemini (F1: 0.71, 0.87, respectively). Bard assigned a non-existent PI-RADS 6 "hallucination" for two patients. Inter-reader agreements (Κ) between the original reports and the senior radiologist, junior radiologist, GPT-3.5, GTP-4, BARD, and Gemini were 0.93, 0.84, 0.65, 0.86, 0.57, and 0.81, respectively.

Conclusions: Radiologists demonstrated high accuracy in PI-RADS classification based on text reports, while GPT-3.5 and Bard exhibited poor performance. GTP-4 and Gemini demonstrated improved performance compared to their predecessors.

Advances in knowledge: This study highlights the limitations of LLMs in accurately classifying PI-RADS categories from clinical text reports. While the performance of LLMs has improved with newer versions, caution is warranted before integrating such technologies into clinical practice.

基于前列腺多参数磁共振成像文本报告,评估 ChatGPT 和 Bard/Gemini 在 PI-RADS 分类方面与放射科医生的对比性能。
目的:大语言模型(LLMs)已显示出临床应用的潜力。本研究评估了它们根据临床文本报告分配 PI-RADS 类别的能力:方法:由两名泌尿放射科医生对 100 名连续的未经活检患者的多参数前列腺 MRI 报告、GPT-3.5、GPT-4、Bard 和 Gemini 进行独立分类。原始报告分类被认为是确定的:在 100 例 MRI 中,52 例最初报告为 PI-RADS 1-2,9 例 PI-RADS 3,19 例 PI-RADS 4,20 例 PI-RADS 5。放射医师的准确率分别为 95% 和 90%,而 GPT-3.5 和 Bard 的准确率均为 67%。更新版 LLM 的准确率分别提高到 83%(GTP-4)和 79%(Gemini)。在低怀疑研究(PI-RADS 1-2)中,Bard 和 Gemini(F1:分别为 0.94 和 0.98)的表现优于 GPT-3.5 和 GTP-4(F1:分别为 0.77 和 0.94),而在高概率 MRI(PI-RADS 4-5)中,GPT-3.5 和 GTP-4(F1:分别为 0.95 和 0.98)的表现优于 Bard 和 Gemini(F1:分别为 0.71 和 0.87)。Bard 为两名患者指定了不存在的 PI-RADS 6 "幻觉"。原始报告与资深放射科医师、初级放射科医师、GPT-3.5、GTP-4、BARD 和 Gemini 之间的读片者间一致性(Κ)分别为 0.93、0.84、0.65、0.86、0.57 和 0.81:放射医师根据文本报告进行 PI-RADS 分类的准确率较高,而 GPT-3.5 和 Bard 的表现较差。GTP-4 和 Gemini 的性能则比其前辈有所提高:本研究强调了 LLM 在从临床文本报告中准确划分 PI-RADS 类别方面的局限性。虽然 LLMs 的性能随着版本的更新而有所提高,但在将此类技术应用于临床实践之前仍需谨慎。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
British Journal of Radiology
British Journal of Radiology 医学-核医学
CiteScore
5.30
自引率
3.80%
发文量
330
审稿时长
2-4 weeks
期刊介绍: BJR is the international research journal of the British Institute of Radiology and is the oldest scientific journal in the field of radiology and related sciences. Dating back to 1896, BJR’s history is radiology’s history, and the journal has featured some landmark papers such as the first description of Computed Tomography "Computerized transverse axial tomography" by Godfrey Hounsfield in 1973. A valuable historical resource, the complete BJR archive has been digitized from 1896. Quick Facts: - 2015 Impact Factor – 1.840 - Receipt to first decision – average of 6 weeks - Acceptance to online publication – average of 3 weeks - ISSN: 0007-1285 - eISSN: 1748-880X Open Access option
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信