Evaluating the reference accuracy of large language models in radiology: a comparative study across subspecialties.

IF 1.7 4区医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Diagnostic and interventional radiology Pub Date : 2025-05-12 DOI:10.4274/dir.2025.253101

Yasin Celal Güneş, Turay Cesur, Eren Çamur

{"title":"Evaluating the reference accuracy of large language models in radiology: a comparative study across subspecialties.","authors":"Yasin Celal Güneş, Turay Cesur, Eren Çamur","doi":"10.4274/dir.2025.253101","DOIUrl":null,"url":null,"abstract":"Purpose: This study aimed to compare six large language models (LLMs) [Chat Generative Pre-trained Transformer (ChatGPT)o1-preview, ChatGPT-4o, ChatGPT-4o with canvas, Google Gemini 1.5 Pro, Claude 3.5 Sonnet, and Claude 3 Opus] in generating radiology references, assessing accuracy, fabrication, and bibliographic completeness.Methods: In this cross-sectional observational study, 120 open-ended questions were administered across eight radiology subspecialties (neuroradiology, abdominal, musculoskeletal, thoracic, pediatric, cardiac, head and neck, and interventional radiology), with 15 questions per subspecialty. Each question prompted the LLMs to provide responses containing four references with in-text citations and complete bibliographic details (authors, title, journal, publication year/month, volume, issue, page numbers, and PubMed Identifier). References were verified using Medline, Google Scholar, the Directory of Open Access Journals, and web searches. Each bibliographic element was scored for correctness, and a composite final score [(FS): 0-36] was calculated by summing the correct elements and multiplying this by a 5-point verification score for content relevance. The FS values were then categorized into a 5-point Likert scale reference accuracy score (RAS: 0 = fabricated; 4 = fully accurate). Non-parametric tests (Kruskal-Wallis, Tamhane's T2, Wilcoxon signed-rank test with Bonferroni correction) were used for statistical comparisons.Results: Claude 3.5 Sonnet demonstrated the highest reference accuracy, with 80.8% fully accurate references (RAS 4) and a fabrication rate of 3.1%, significantly outperforming all other models (P < 0.001). Claude 3 Opus ranked second, achieving 59.6% fully accurate references and a fabrication rate of 18.3% (P < 0.001). ChatGPT-based models (ChatGPT-4o, ChatGPT-4o with canvas, and ChatGPT o1-preview) exhibited moderate accuracy, with fabrication rates ranging from 27.7% to 52.9% and <8% fully accurate references. Google Gemini 1.5 Pro had the lowest performance, achieving only 2.7% fully accurate references and the highest fabrication rate of 60.6% (P < 0.001). Reference accuracy also varied by subspecialty, with neuroradiology and cardiac radiology outperforming pediatric and head and neck radiology.Conclusion: Claude 3.5 Sonnet significantly outperformed all other models in generating verifiable radiology references, and Claude 3 Opus showed moderate performance. In contrast, ChatGPT models and Google Gemini 1.5 Pro delivered substantially lower accuracy with higher rates of fabricated references, highlighting current limitations in automated academic citation generation.Clinical significance: The high accuracy of Claude 3.5 Sonnet can improve radiology literature reviews, research, and education with dependable references. The poor performance of other models, with high fabrication rates, risks misinformation in clinical and academic settings and highlights the need for refinement to ensure safe and effective use.","PeriodicalId":11341,"journal":{"name":"Diagnostic and interventional radiology","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and interventional radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.4274/dir.2025.253101","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: This study aimed to compare six large language models (LLMs) [Chat Generative Pre-trained Transformer (ChatGPT)o1-preview, ChatGPT-4o, ChatGPT-4o with canvas, Google Gemini 1.5 Pro, Claude 3.5 Sonnet, and Claude 3 Opus] in generating radiology references, assessing accuracy, fabrication, and bibliographic completeness.

Methods: In this cross-sectional observational study, 120 open-ended questions were administered across eight radiology subspecialties (neuroradiology, abdominal, musculoskeletal, thoracic, pediatric, cardiac, head and neck, and interventional radiology), with 15 questions per subspecialty. Each question prompted the LLMs to provide responses containing four references with in-text citations and complete bibliographic details (authors, title, journal, publication year/month, volume, issue, page numbers, and PubMed Identifier). References were verified using Medline, Google Scholar, the Directory of Open Access Journals, and web searches. Each bibliographic element was scored for correctness, and a composite final score [(FS): 0-36] was calculated by summing the correct elements and multiplying this by a 5-point verification score for content relevance. The FS values were then categorized into a 5-point Likert scale reference accuracy score (RAS: 0 = fabricated; 4 = fully accurate). Non-parametric tests (Kruskal-Wallis, Tamhane's T2, Wilcoxon signed-rank test with Bonferroni correction) were used for statistical comparisons.

Results: Claude 3.5 Sonnet demonstrated the highest reference accuracy, with 80.8% fully accurate references (RAS 4) and a fabrication rate of 3.1%, significantly outperforming all other models (P < 0.001). Claude 3 Opus ranked second, achieving 59.6% fully accurate references and a fabrication rate of 18.3% (P < 0.001). ChatGPT-based models (ChatGPT-4o, ChatGPT-4o with canvas, and ChatGPT o1-preview) exhibited moderate accuracy, with fabrication rates ranging from 27.7% to 52.9% and <8% fully accurate references. Google Gemini 1.5 Pro had the lowest performance, achieving only 2.7% fully accurate references and the highest fabrication rate of 60.6% (P < 0.001). Reference accuracy also varied by subspecialty, with neuroradiology and cardiac radiology outperforming pediatric and head and neck radiology.

Conclusion: Claude 3.5 Sonnet significantly outperformed all other models in generating verifiable radiology references, and Claude 3 Opus showed moderate performance. In contrast, ChatGPT models and Google Gemini 1.5 Pro delivered substantially lower accuracy with higher rates of fabricated references, highlighting current limitations in automated academic citation generation.

Clinical significance: The high accuracy of Claude 3.5 Sonnet can improve radiology literature reviews, research, and education with dependable references. The poor performance of other models, with high fabrication rates, risks misinformation in clinical and academic settings and highlights the need for refinement to ensure safe and effective use.

查看原文本刊更多论文

评估放射学中大型语言模型的参考准确性：跨亚专业的比较研究。

目的：本研究旨在比较六个大型语言模型（LLMs） [ChatGPT生成预训练的Transformer (ChatGPT) 01 -preview, ChatGPT- 40, ChatGPT- 40 with canvas，谷歌Gemini 1.5 Pro, Claude 3.5 Sonnet和Claude 3 Opus]在生成放射学参考文献，评估准确性，制作和文献完整性方面的差异。方法：在这项横断面观察性研究中，对8个放射学亚专业（神经放射学、腹部放射学、肌肉骨骼学、胸椎放射学、儿科放射学、心脏放射学、头颈放射学和介入放射学）的120个开放式问题进行了研究，每个亚专业有15个问题。每个问题都要求法学硕士提供包含四篇参考文献的回答，其中包含文本引用和完整的书目详细信息（作者、标题、期刊、出版年份/月、卷、期、页码和PubMed标识符）。参考文献通过Medline、b谷歌Scholar、开放获取期刊目录和网络搜索进行验证。对每个书目元素的正确性进行评分，通过将正确的元素相加并乘以内容相关性的5分验证分数，计算出综合最终分数[(FS): 0-36]。然后将FS值分类为5点李克特量表参考准确度评分(RAS: 0 =捏造；4 =完全准确)。采用非参数检验（Kruskal-Wallis、Tamhane’s T2、Wilcoxon带Bonferroni校正的符号秩检验）进行统计比较。结果：Claude 3.5 Sonnet具有最高的参考准确性，完全准确的参考文献为80.8% (RAS 4)，捏造率为3.1%，显著优于所有其他模型（P < 0.001）。Claude 3 Opus排名第二，获得59.6%的完全准确参考和18.3%的制作率（P < 0.001）。基于ChatGPT的模型（ChatGPT- 40， ChatGPT- 40与画布，和ChatGPT 01 -预览）表现出中等的准确性，制作率从27.7%到52.9%,P < 0.001)。参考准确性也因专科而异，神经放射学和心脏放射学优于儿科和头颈部放射学。结论：Claude 3.5 Sonnet在生成可验证的放射学参考文献方面明显优于所有其他模型，Claude 3 Opus表现中等。相比之下，ChatGPT模型和谷歌Gemini 1.5 Pro的准确性明显较低，捏造参考文献的比率较高，这突出了目前自动化学术引文生成的局限性。临床意义：Claude 3.5 Sonnet准确率高，可为影像学文献回顾、研究和教育提供可靠参考。其他模型的性能差，捏造率高，在临床和学术环境中存在错误信息的风险，并突出了改进以确保安全有效使用的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Diagnostic and interventional radiology Medicine-Radiology, Nuclear Medicine and Imaging

自引率

4.80%

发文量

期刊介绍： Diagnostic and Interventional Radiology (Diagn Interv Radiol) is the open access, online-only official publication of Turkish Society of Radiology. It is published bimonthly and the journal’s publication language is English. The journal is a medium for original articles, reviews, pictorial essays, technical notes related to all fields of diagnostic and interventional radiology.