Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5^th edition.

IF 1.4 4区医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Diagnostic and interventional radiology Pub Date : 2025-03-03 Epub Date: 2024-09-09 DOI:10.4274/dir.2024.242876

Yasin Celal Güneş, Turay Cesur, Eren Çamur, Leman Günbey Karabekmez

{"title":"Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition.","authors":"Yasin Celal Güneş, Turay Cesur, Eren Çamur, Leman Günbey Karabekmez","doi":"10.4274/dir.2024.242876","DOIUrl":null,"url":null,"abstract":"Purpose: This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions.Methods: This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests.Results: Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05).Conclusion: Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses.Clinical significance: This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.","PeriodicalId":11341,"journal":{"name":"Diagnostic and interventional radiology","volume":" ","pages":"111-129"},"PeriodicalIF":1.4000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11880873/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and interventional radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.4274/dir.2024.242876","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/9 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions.

Methods: This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5^th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests.

Results: Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05).

Conclusion: Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses.

Clinical significance: This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.

查看原文本刊更多论文

评估大型语言模型对《乳腺成像报告和数据系统图集》第 5 版相关问题的文本和视觉诊断能力。

目的：本研究旨在评估大型语言模型（LLMs）和多模态 LLMs 在解释乳腺成像报告和数据系统（BI-RADS）类别以及提供基于文本和视觉问题的乳腺放射学临床管理建议方面的性能：这项横断面观察研究包括两个步骤。第一步，我们比较了十种 LLM（即 ChatGPT 4o、ChatGPT 4、ChatGPT 3.5、Google Gemini 1.5 Pro、Google Gemini 1.0、Microsoft Copilot、Perplexity、Claude 3.5 Sonnet、Claude 3 Opus 和 Claude 3 Opus 200K）、普通放射科医生和一位乳腺放射科医生使用与 BI-RADS 图集第五版相关的 100 道基于文本的选择题（MCQ）的情况。第二步，我们评估了五种多模态 LLM（ChatGPT 4o、ChatGPT 4V、Claude 3.5 Sonnet、Claude 3 Opus 和 Google Gemini 1.5 Pro）在对 100 张乳腺超声图像分配 BI-RADS 类别和提供临床管理建议方面的性能。采用 McNemar 检验和卡方检验对不同问题类型的正确答案和准确性进行了比较分析。管理得分采用 Kruskal- Wallis 和 Wilcoxon 检验进行分析：Claude 3.5 Sonnet 在文本 MCQ 中的准确率最高（90%），其次是 ChatGPT 4o（89%），超过了所有其他 LLM 和普通放射科医生（78% 和 76%）（P < 0.05），但 Claude 3 Opus 模型和乳腺放射科医生（82%）除外（P > 0.05）。表现较差的 LLM 包括 Google Gemini 1.0（61%）和 ChatGPT 3.5（60%）。不同类别的 LLM 和放射科医生之间的表现无明显差异（P > 0.05）。对于乳腺超声图像，Claude 3.5 Sonnet 的准确率为 59%，明显高于其他多模态 LLM（P < 0.05）。管理建议采用 3 点李克特量表进行评估，Claude 3.5 Sonnet 得分最高（平均值：2.12 ± 0.97）（P < 0.05）。除 Claude 3 Opus 外（P < 0.05），BI-RADS 各类别的准确性差异很大。Gemini 1.5 Pro 未能正确回答任何 BI-RADS 5 问题。同样，ChatGPT 4V 也未能正确回答任何 BI-RADS 1 问题，因此在这些类别中准确率最低（P < 0.05）：尽管 Claude 3.5 Sonnet 和 ChatGPT 4o 等 LLM 在基于文本的 BI-RADS 评估中显示出了前景，但它们在视觉诊断方面的局限性表明，应在放射医师的监督下谨慎使用，以避免误诊：本研究表明，虽然 LLM 在基于文本的 BI-RADS 评估中表现出很强的能力，但其视觉诊断能力目前还很有限，因此有必要进一步开发并在临床实践中谨慎应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Diagnostic and interventional radiology Medicine-Radiology, Nuclear Medicine and Imaging

自引率

4.80%

发文量

期刊介绍： Diagnostic and Interventional Radiology (Diagn Interv Radiol) is the open access, online-only official publication of Turkish Society of Radiology. It is published bimonthly and the journal’s publication language is English. The journal is a medium for original articles, reviews, pictorial essays, technical notes related to all fields of diagnostic and interventional radiology.