Summarizing clinical evidence utilizing large language models for cancer treatments: a blinded comparative analysis.

IF 3.2 Q1 HEALTH CARE SCIENCES & SERVICES

Frontiers in digital health Pub Date : 2025-04-29 eCollection Date: 2025-01-01 DOI:10.3389/fdgth.2025.1569554

Samuel Rubinstein, Aleenah Mohsin, Rahul Banerjee, Will Ma, Sanjay Mishra, Mary Kwok, Peter Yang, Jeremy L Warner, Andrew J Cowan

{"title":"Summarizing clinical evidence utilizing large language models for cancer treatments: a blinded comparative analysis.","authors":"Samuel Rubinstein, Aleenah Mohsin, Rahul Banerjee, Will Ma, Sanjay Mishra, Mary Kwok, Peter Yang, Jeremy L Warner, Andrew J Cowan","doi":"10.3389/fdgth.2025.1569554","DOIUrl":null,"url":null,"abstract":"Background: Concise synopses of clinical evidence support treatment decision-making but are time-consuming to curate. Large language models (LLMs) offer potential but they may provide inaccurate information. We objectively assessed the abilities of four commercially available LLMs to generate synopses for six treatment regimens in multiple myeloma and amyloid light chain (AL) amyloidosis.Methods: We compared the performance of four LLMs: Claude 3.5, ChatGPT 4.0; Gemini 1.0 and Llama-3.1. Each LLM was prompted to write synopses for six regimens. Two hematologists independently assessed accuracy, completeness, relevance, clarity, coherence, and hallucinations using Likert scales. Mean scores with 95% confidence intervals (CI) were calculated across all domains and inter-rater reliability was evaluated using Cohen's quadratic weighted kappa.Results: Claude demonstrated the highest performance in all domains, outperforming the other LLMs in accuracy: mean Likert score 3.92 (95% CI 3.54-4.29); ChatGPT 3.25 (2.76-3.74); Gemini 3.17 (2.54-3.80); Llama 1.92 (1.41-2.43);completeness: mean Likert score 4.00 (3.66-4.34); GPT 2.58 (2.02-3.15); Gemini 2.58 (2.02-3.15); Llama 1.67 (1.39-1.95); and extentofhallucinations: mean Likert score 4.00 (4.00-4.00); ChatGPT 2.75 (2.06-3.44); Gemini 3.25 (2.65-3.85); Llama 1.92 (1.26-2.57). Llama performed considerably poorer across all the studied domains. ChatGPT and Gemini had intermediate performance. Notably, none of the LLMs registered perfect accuracy, completeness, or relevance.Conclusion: Claude performed at a consistently higher level than other LLMs, all tested LLMs required careful editing from a domain expert to become usable. More time will be needed to determine the suitability of LLMsto independently generate clinical synopses.","PeriodicalId":73078,"journal":{"name":"Frontiers in digital health","volume":"7 ","pages":"1569554"},"PeriodicalIF":3.2000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12069342/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdgth.2025.1569554","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Concise synopses of clinical evidence support treatment decision-making but are time-consuming to curate. Large language models (LLMs) offer potential but they may provide inaccurate information. We objectively assessed the abilities of four commercially available LLMs to generate synopses for six treatment regimens in multiple myeloma and amyloid light chain (AL) amyloidosis.

Methods: We compared the performance of four LLMs: Claude 3.5, ChatGPT 4.0; Gemini 1.0 and Llama-3.1. Each LLM was prompted to write synopses for six regimens. Two hematologists independently assessed accuracy, completeness, relevance, clarity, coherence, and hallucinations using Likert scales. Mean scores with 95% confidence intervals (CI) were calculated across all domains and inter-rater reliability was evaluated using Cohen's quadratic weighted kappa.

Results: Claude demonstrated the highest performance in all domains, outperforming the other LLMs in accuracy: mean Likert score 3.92 (95% CI 3.54-4.29); ChatGPT 3.25 (2.76-3.74); Gemini 3.17 (2.54-3.80); Llama 1.92 (1.41-2.43);completeness: mean Likert score 4.00 (3.66-4.34); GPT 2.58 (2.02-3.15); Gemini 2.58 (2.02-3.15); Llama 1.67 (1.39-1.95); and extentofhallucinations: mean Likert score 4.00 (4.00-4.00); ChatGPT 2.75 (2.06-3.44); Gemini 3.25 (2.65-3.85); Llama 1.92 (1.26-2.57). Llama performed considerably poorer across all the studied domains. ChatGPT and Gemini had intermediate performance. Notably, none of the LLMs registered perfect accuracy, completeness, or relevance.

Conclusion: Claude performed at a consistently higher level than other LLMs, all tested LLMs required careful editing from a domain expert to become usable. More time will be needed to determine the suitability of LLMsto independently generate clinical synopses.

查看原文本刊更多论文

利用大型语言模型总结癌症治疗的临床证据：盲法比较分析。

背景：简明的临床证据概要支持治疗决策，但需要花费大量时间。大型语言模型（llm）提供了潜力，但它们可能提供不准确的信息。我们客观地评估了四种商业上可获得的llm为多发性骨髓瘤和淀粉样蛋白轻链（AL）淀粉样变性的六种治疗方案生成大纲的能力。方法：我们比较了四种LLMs的性能：Claude 3.5, ChatGPT 4.0；双子座1.0和羊驼3.1。每位法学硕士被要求为六种治疗方案撰写大纲。两名血液学家独立评估准确性，完整性，相关性，清晰度，连贯性和幻觉使用李克特量表。在所有领域计算95%置信区间（CI）的平均得分，并使用Cohen的二次加权kappa评估评分者之间的信度。结果：Claude在所有领域表现出最高的表现，在准确性方面优于其他法学硕士：平均Likert评分3.92 (95% CI 3.54-4.29)；ChatGPT 3.25 (2.76-3.74)；双子座3.17 (2.54-3.80)；Llama 1.92(1.41-2.43)；完备性：平均Likert评分4.00 (3.66-4.34)；GPT 2.58 (2.02-3.15)；双子座2.58 (2.02-3.15)；羊驼1.67 (1.39-1.95)；延伸幻觉：平均李克特评分4.00 (4.00-4.00)；ChatGPT 2.75 (2.06-3.44)；双子座3.25 (2.65-3.85)；羊驼1.92（1.26-2.57）。羊驼在所有研究领域的表现都相当差。ChatGPT和Gemini表现中等。值得注意的是，没有一个法学硕士具有完美的准确性、完整性或相关性。结论：Claude的表现始终高于其他法学硕士，所有测试的法学硕士都需要领域专家的仔细编辑才能使用。将需要更多的时间来确定llms是否适合独立生成临床概要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊