Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.

IF 3.8 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Academic Radiology Pub Date : 2024-09-30 DOI:10.1016/j.acra.2024.09.041

Elif Can, Wibke Uller, Katharina Vogt, Michael C Doppler, Felix Busch, Nadine Bayerl, Stephan Ellmann, Avan Kader, Aboelyazid Elkilany, Marcus R Makowski, Keno K Bressem, Lisa C Adams

{"title":"Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.","authors":"Elif Can, Wibke Uller, Katharina Vogt, Michael C Doppler, Felix Busch, Nadine Bayerl, Stephan Ellmann, Avan Kader, Aboelyazid Elkilany, Marcus R Makowski, Keno K Bressem, Lisa C Adams","doi":"10.1016/j.acra.2024.09.041","DOIUrl":null,"url":null,"abstract":"Purpose: To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports.Methods: Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis.Results: Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84).Conclusions: GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation.Clinical relevance/applications: With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications.","PeriodicalId":50928,"journal":{"name":"Academic Radiology","volume":null,"pages":null},"PeriodicalIF":3.8000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Academic Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.acra.2024.09.041","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports.

Methods: Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis.

Results: Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84).

Conclusions: GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation.

Clinical relevance/applications: With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications.

查看原文本刊更多论文

简化介入放射学报告的大语言模型：比较分析。

目的：定量和定性地评估和比较领先的大型语言模型（LLM），包括专利模型（GPT-4、GPT-3.5 Turbo、Claude-3-Opus 和 Gemini Ultra）和开源模型（Mistral-7b 和 Mistral-8×7b）在简化 109 份介入放射学报告方面的性能：采用李克特五点量表对准确性、完整性、清晰度、临床相关性、自然度和错误率（包括破坏信任和治疗后不当行为错误）进行定性评估。定量可读性采用 Flesch Reading Ease (FRE)、Flesch-Kincaid Grade Level (FKGL)、SMOG Index 和 Dale-Chall Readability Score (DCRS) 进行评估。统计分析采用配对 t 检验和 Bonferroni 校正 p 值：定性评估结果表明，GPT-4 和 Claude-3-Opus 在所有评估指标上都没有明显差异（所有 Bonferroni 校正 p 值：p = 1），而在五个定性指标上，它们的表现优于其他评估模型（p 结论：GPT-4 和 Claude-3-Opus 在所有评估指标上都没有明显差异（所有 Bonferroni 校正 p 值：p = 1），而在五个定性指标上，它们的表现优于其他评估模型：GPT-4和Claude-3-Opus在生成简化的IR报告方面表现优异，但所有模型都存在误差，包括破坏信任的误差，这凸显了在临床应用前进一步完善和验证的必要性：临床相关性/应用：随着介入放射学（IR）手术的日益复杂和电子健康记录的日益普及，简化IR报告对于改善患者理解和临床决策至关重要。本研究深入探讨了各种 LLM 在重写 IR 报告方面的性能，有助于为以患者为中心的临床应用选择最合适的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Academic Radiology 医学-核医学

CiteScore

7.60

自引率

10.40%

发文量

432

审稿时长

18 days

期刊介绍： Academic Radiology publishes original reports of clinical and laboratory investigations in diagnostic imaging, the diagnostic use of radioactive isotopes, computed tomography, positron emission tomography, magnetic resonance imaging, ultrasound, digital subtraction angiography, image-guided interventions and related techniques. It also includes brief technical reports describing original observations, techniques, and instrumental developments; state-of-the-art reports on clinical issues, new technology and other topics of current medical importance; meta-analyses; scientific studies and opinions on radiologic education; and letters to the Editor.