Philipp Prucker , Felix Busch , Felix Dorfner , Christian J. Mertens , Nadine Bayerl , Marcus R. Makowski , Keno K. Bressem , Lisa C. Adams
{"title":"开源和专有的大型语言模型在生成患者友好的放射学胸部CT报告中的性能","authors":"Philipp Prucker , Felix Busch , Felix Dorfner , Christian J. Mertens , Nadine Bayerl , Marcus R. Makowski , Keno K. Bressem , Lisa C. Adams","doi":"10.1016/j.clinimag.2025.110557","DOIUrl":null,"url":null,"abstract":"<div><h3>Rationale and objectives</h3><div>Large Language Models (LLMs) show promise for generating patient-friendly radiology reports, but the performance of open-source versus proprietary LLMs needs assessment. To compare open-source and proprietary LLMs in generating patient-friendly radiology reports from chest CTs using quantitative readability metrics and qualitative assessments by radiologists.</div></div><div><h3>Materials and methods</h3><div>Fifty chest CT reports were processed by seven LLMs: three open-source models (Llama-3-70b, Mistral-7b, Mixtral-8x7b) and four proprietary models (GPT-4, GPT-3.5-Turbo, Claude-3-Opus, Gemini-Ultra). Simplification was evaluated using five quantitative readability metrics. Three radiologists rated patient-friendliness on a five-point Likert scale across five criteria. Content and coherence errors were counted. Inter-rater reliability and differences among models were statistically assessed.</div></div><div><h3>Results</h3><div>Inter-rater reliability was substantial to near perfect (κ = 0.76–0.86). Qualitatively, Llama-3-70b was non-inferior to leading proprietary models in 4/5 categories. GPT-3.5-Turbo showed the best overall readability, outperforming GPT-4 in two metrics. Llama-3-70b outperformed GPT-3.5-Turbo on the CLI (<em>p</em> = 0.006). Claude-3-Opus and Gemini-Ultra scored lower on readability but were rated highly in qualitative assessments. Claude-3-Opus maintained perfect factual accuracy. Claude-3-Opus and GPT-4 outperformed Llama-3-70b in emotional sensitivity (90.0 % vs 46.0 %, <em>p</em> < 0.001).</div></div><div><h3>Conclusions</h3><div>Llama-3-70b shows strong potential in generating quality, patient-friendly radiology reports, challenging proprietary models. With further adaptation, open-source LLMs could advance patient-friendly reporting technology.</div></div>","PeriodicalId":50680,"journal":{"name":"Clinical Imaging","volume":"125 ","pages":"Article 110557"},"PeriodicalIF":1.5000,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of open-source and proprietary large language models in generating patient-friendly radiology chest CT reports\",\"authors\":\"Philipp Prucker , Felix Busch , Felix Dorfner , Christian J. Mertens , Nadine Bayerl , Marcus R. Makowski , Keno K. Bressem , Lisa C. Adams\",\"doi\":\"10.1016/j.clinimag.2025.110557\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Rationale and objectives</h3><div>Large Language Models (LLMs) show promise for generating patient-friendly radiology reports, but the performance of open-source versus proprietary LLMs needs assessment. To compare open-source and proprietary LLMs in generating patient-friendly radiology reports from chest CTs using quantitative readability metrics and qualitative assessments by radiologists.</div></div><div><h3>Materials and methods</h3><div>Fifty chest CT reports were processed by seven LLMs: three open-source models (Llama-3-70b, Mistral-7b, Mixtral-8x7b) and four proprietary models (GPT-4, GPT-3.5-Turbo, Claude-3-Opus, Gemini-Ultra). Simplification was evaluated using five quantitative readability metrics. Three radiologists rated patient-friendliness on a five-point Likert scale across five criteria. Content and coherence errors were counted. Inter-rater reliability and differences among models were statistically assessed.</div></div><div><h3>Results</h3><div>Inter-rater reliability was substantial to near perfect (κ = 0.76–0.86). Qualitatively, Llama-3-70b was non-inferior to leading proprietary models in 4/5 categories. GPT-3.5-Turbo showed the best overall readability, outperforming GPT-4 in two metrics. Llama-3-70b outperformed GPT-3.5-Turbo on the CLI (<em>p</em> = 0.006). Claude-3-Opus and Gemini-Ultra scored lower on readability but were rated highly in qualitative assessments. Claude-3-Opus maintained perfect factual accuracy. Claude-3-Opus and GPT-4 outperformed Llama-3-70b in emotional sensitivity (90.0 % vs 46.0 %, <em>p</em> < 0.001).</div></div><div><h3>Conclusions</h3><div>Llama-3-70b shows strong potential in generating quality, patient-friendly radiology reports, challenging proprietary models. With further adaptation, open-source LLMs could advance patient-friendly reporting technology.</div></div>\",\"PeriodicalId\":50680,\"journal\":{\"name\":\"Clinical Imaging\",\"volume\":\"125 \",\"pages\":\"Article 110557\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical Imaging\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0899707125001573\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Imaging","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0899707125001573","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
摘要
基本原理和目标大型语言模型(llm)有望生成对患者友好的放射学报告,但是开源与专有llm的性能需要评估。比较开源法学硕士和专有法学硕士在使用定量可读性指标和放射科医生的定性评估从胸部ct生成患者友好的放射学报告方面的差异。材料与方法采用7台llm对50份胸部CT报告进行处理:3台开源模型(Llama-3-70b、Mistral-7b、Mixtral-8x7b)和4台专有模型(GPT-4、GPT-3.5-Turbo、Claude-3-Opus、Gemini-Ultra)。简化使用5个定量可读性指标进行评估。三名放射科医生根据五项标准,以李克特五分制对患者友好度进行评分。计算内容误差和连贯性误差。对模型间的信度和差异进行统计评估。结果评分间信度基本接近完美(κ = 0.76 ~ 0.86)。定性上,羊驼-3-70b在4/5个类别中不逊色于领先的专有车型。GPT-3.5-Turbo表现出最好的整体可读性,在两个指标上优于GPT-4。lama-3-70b在CLI上的表现优于GPT-3.5-Turbo (p = 0.006)。Claude-3-Opus和Gemini-Ultra在可读性上得分较低,但在定性评估上得分很高。Claude-3-Opus保持了完美的事实准确性。Claude-3-Opus和GPT-4在情绪敏感性方面优于Llama-3-70b (90.0% vs 46.0%, p <;0.001)。结论slama -3-70b在生成高质量、对患者友好的放射学报告方面具有强大的潜力,挑战了专有模型。通过进一步的调整,开源法学硕士可以推进对患者友好的报告技术。
Performance of open-source and proprietary large language models in generating patient-friendly radiology chest CT reports
Rationale and objectives
Large Language Models (LLMs) show promise for generating patient-friendly radiology reports, but the performance of open-source versus proprietary LLMs needs assessment. To compare open-source and proprietary LLMs in generating patient-friendly radiology reports from chest CTs using quantitative readability metrics and qualitative assessments by radiologists.
Materials and methods
Fifty chest CT reports were processed by seven LLMs: three open-source models (Llama-3-70b, Mistral-7b, Mixtral-8x7b) and four proprietary models (GPT-4, GPT-3.5-Turbo, Claude-3-Opus, Gemini-Ultra). Simplification was evaluated using five quantitative readability metrics. Three radiologists rated patient-friendliness on a five-point Likert scale across five criteria. Content and coherence errors were counted. Inter-rater reliability and differences among models were statistically assessed.
Results
Inter-rater reliability was substantial to near perfect (κ = 0.76–0.86). Qualitatively, Llama-3-70b was non-inferior to leading proprietary models in 4/5 categories. GPT-3.5-Turbo showed the best overall readability, outperforming GPT-4 in two metrics. Llama-3-70b outperformed GPT-3.5-Turbo on the CLI (p = 0.006). Claude-3-Opus and Gemini-Ultra scored lower on readability but were rated highly in qualitative assessments. Claude-3-Opus maintained perfect factual accuracy. Claude-3-Opus and GPT-4 outperformed Llama-3-70b in emotional sensitivity (90.0 % vs 46.0 %, p < 0.001).
Conclusions
Llama-3-70b shows strong potential in generating quality, patient-friendly radiology reports, challenging proprietary models. With further adaptation, open-source LLMs could advance patient-friendly reporting technology.
期刊介绍:
The mission of Clinical Imaging is to publish, in a timely manner, the very best radiology research from the United States and around the world with special attention to the impact of medical imaging on patient care. The journal''s publications cover all imaging modalities, radiology issues related to patients, policy and practice improvements, and clinically-oriented imaging physics and informatics. The journal is a valuable resource for practicing radiologists, radiologists-in-training and other clinicians with an interest in imaging. Papers are carefully peer-reviewed and selected by our experienced subject editors who are leading experts spanning the range of imaging sub-specialties, which include:
-Body Imaging-
Breast Imaging-
Cardiothoracic Imaging-
Imaging Physics and Informatics-
Molecular Imaging and Nuclear Medicine-
Musculoskeletal and Emergency Imaging-
Neuroradiology-
Practice, Policy & Education-
Pediatric Imaging-
Vascular and Interventional Radiology