Comparative Evaluation of a Medical Large Language Model in Answering Real-World Radiation Oncology Questions: Multicenter Observational Study.

IF 6 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Internet Research Pub Date : 2025-09-23 DOI:10.2196/69752

Fabio Dennstädt, Max Schmerder, Elena Riggenbach, Lucas Mose, Katarina Bryjova, Nicolas Bachmann, Paul-Henry Mackeprang, Maiwand Ahmadsei, Dubravko Sinovcic, Paul Windisch, Daniel Zwahlen, Susanne Rogers, Oliver Riesterer, Martin Maffei, Eleni Gkika, Hathal Haddad, Jan Peeken, Paul Martin Putora, Markus Glatzer, Florian Putz, Daniel Hoefler, Sebastian M Christ, Irina Filchenko, Janna Hastings, Roberto Gaio, Lawrence Chiang, Daniel M Aebersold, Nikola Cihoric

{"title":"Comparative Evaluation of a Medical Large Language Model in Answering Real-World Radiation Oncology Questions: Multicenter Observational Study.","authors":"Fabio Dennstädt, Max Schmerder, Elena Riggenbach, Lucas Mose, Katarina Bryjova, Nicolas Bachmann, Paul-Henry Mackeprang, Maiwand Ahmadsei, Dubravko Sinovcic, Paul Windisch, Daniel Zwahlen, Susanne Rogers, Oliver Riesterer, Martin Maffei, Eleni Gkika, Hathal Haddad, Jan Peeken, Paul Martin Putora, Markus Glatzer, Florian Putz, Daniel Hoefler, Sebastian M Christ, Irina Filchenko, Janna Hastings, Roberto Gaio, Lawrence Chiang, Daniel M Aebersold, Nikola Cihoric","doi":"10.2196/69752","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) hold promise for supporting clinical tasks, particularly in data-driven and technical disciplines such as radiation oncology. While prior evaluation studies have focused on examination-style settings for evaluating LLMs, their performance in real-life clinical scenarios remains unclear. In the future, LLMs might be used as general AI assistants to answer questions arising in clinical practice. It is unclear how well a modern LLM, locally executed within the infrastructure of a hospital, would answer such questions compared with clinical experts.Objective: This study aimed to assess the performance of a locally deployed, state-of-the-art medical LLM in answering real-world clinical questions in radiation oncology compared with clinical experts. The aim was to evaluate the overall quality of answers, as well as the potential harmfulness of the answers if used for clinical decision-making.Methods: Physicians from 10 departments of European hospitals collected questions arising in the clinical practice of radiation oncology. Fifty of these questions were answered by 3 senior radiation oncology experts with at least 10 years of work experience, as well as the LLM Llama3-OpenBioLLM-70B (Ankit Pal and Malaikannan Sankarasubbu). In a blinded review, physicians rated the overall answer quality on a 5-point Likert scale (quality), assessed whether an answer might be potentially harmful if used for clinical decision-making (harmfulness), and determined if responses were from an expert or the LLM (recognizability). Comparisons between clinical experts and LLMs were then made for quality, harmfulness, and recognizability.Results: There were no significant differences between the quality of the answers between LLM and clinical experts (mean scores of 3.38 vs 3.63; median 4.00, IQR 3.00-4.00 vs median 3.67, IQR 3.33-4.00; P=.26; Wilcoxon signed rank test). The answers were deemed potentially harmful in 13% of cases for the clinical experts compared with 16% of cases for the LLM (P=.63; Fisher exact test). Physicians correctly identified whether an answer was given by a clinical expert or an LLM in 78% and 72% of cases, respectively.Conclusions: A state-of-the-art medical LLM can answer real-life questions from the clinical practice of radiation oncology similarly well as clinical experts regarding overall quality and potential harmfulness. Such LLMs can already be deployed within the local hospital environment at an affordable cost. While LLMs may not yet be ready for clinical implementation as general AI assistants, the technology continues to improve at a rapid pace. Evaluation studies based on real-life situations are important to better understand the weaknesses and limitations of LLMs in clinical practice. Such studies are also crucial to define when the technology is ready for clinical implementation. Furthermore, education for health care professionals on generative AI is needed to ensure responsible clinical implementation of this transforming technology.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e69752"},"PeriodicalIF":6.0000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12504895/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/69752","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) hold promise for supporting clinical tasks, particularly in data-driven and technical disciplines such as radiation oncology. While prior evaluation studies have focused on examination-style settings for evaluating LLMs, their performance in real-life clinical scenarios remains unclear. In the future, LLMs might be used as general AI assistants to answer questions arising in clinical practice. It is unclear how well a modern LLM, locally executed within the infrastructure of a hospital, would answer such questions compared with clinical experts.

Objective: This study aimed to assess the performance of a locally deployed, state-of-the-art medical LLM in answering real-world clinical questions in radiation oncology compared with clinical experts. The aim was to evaluate the overall quality of answers, as well as the potential harmfulness of the answers if used for clinical decision-making.

Methods: Physicians from 10 departments of European hospitals collected questions arising in the clinical practice of radiation oncology. Fifty of these questions were answered by 3 senior radiation oncology experts with at least 10 years of work experience, as well as the LLM Llama3-OpenBioLLM-70B (Ankit Pal and Malaikannan Sankarasubbu). In a blinded review, physicians rated the overall answer quality on a 5-point Likert scale (quality), assessed whether an answer might be potentially harmful if used for clinical decision-making (harmfulness), and determined if responses were from an expert or the LLM (recognizability). Comparisons between clinical experts and LLMs were then made for quality, harmfulness, and recognizability.

Results: There were no significant differences between the quality of the answers between LLM and clinical experts (mean scores of 3.38 vs 3.63; median 4.00, IQR 3.00-4.00 vs median 3.67, IQR 3.33-4.00; P=.26; Wilcoxon signed rank test). The answers were deemed potentially harmful in 13% of cases for the clinical experts compared with 16% of cases for the LLM (P=.63; Fisher exact test). Physicians correctly identified whether an answer was given by a clinical expert or an LLM in 78% and 72% of cases, respectively.

Conclusions: A state-of-the-art medical LLM can answer real-life questions from the clinical practice of radiation oncology similarly well as clinical experts regarding overall quality and potential harmfulness. Such LLMs can already be deployed within the local hospital environment at an affordable cost. While LLMs may not yet be ready for clinical implementation as general AI assistants, the technology continues to improve at a rapid pace. Evaluation studies based on real-life situations are important to better understand the weaknesses and limitations of LLMs in clinical practice. Such studies are also crucial to define when the technology is ready for clinical implementation. Furthermore, education for health care professionals on generative AI is needed to ensure responsible clinical implementation of this transforming technology.

Abstract Image

查看原文本刊更多论文

医学大语言模型在回答真实世界放射肿瘤学问题中的比较评价：多中心观察研究。

背景：大型语言模型（llm）有望支持临床任务，特别是在数据驱动和技术学科，如放射肿瘤学。虽然之前的评估研究主要集中在评估法学硕士的考试式设置上，但他们在现实临床场景中的表现仍不清楚。在未来，法学硕士可能会被用作一般的人工智能助手，以回答临床实践中出现的问题。与临床专家相比，在当地医院基础设施内执行的现代法学硕士能在多大程度上回答这些问题，目前还不清楚。目的：本研究旨在评估本地部署的最先进的医学法学硕士在回答放射肿瘤学实际临床问题方面的表现，并与临床专家进行比较。目的是评估答案的整体质量，以及用于临床决策的答案的潜在危害。方法：收集欧洲医院10个科室医师在放射肿瘤学临床实践中出现的问题。其中50个问题由3位具有至少10年工作经验的高级放射肿瘤学专家以及LLM Llama3-OpenBioLLM-70B （Ankit Pal和Malaikannan Sankarasubbu）回答。在一项盲法回顾中，医生用5分李克特量表（质量）对总体答案质量进行了评分，评估了一个答案在用于临床决策时是否可能有害（危害性），并确定回答是来自专家还是法学硕士（可识别性）。然后比较临床专家和法学硕士之间的质量，危害和可识别性。结果：LLM与临床专家的回答质量差异无统计学意义（均分3.38 vs 3.63；中位数4.00,IQR 3.00-4.00 vs中位数3.67,IQR 3.33-4.00； P= 0.26； Wilcoxon sign rank检验）。对于临床专家来说，13%的答案被认为是潜在有害的，而对于法学硕士来说，这一比例为16% （P= 0.63； Fisher精确检验）。在78%和72%的病例中，医生正确地判断出答案是由临床专家还是法学硕士给出的。结论：一个最先进的医学法学硕士可以回答来自放射肿瘤学临床实践的现实问题，以及临床专家关于整体质量和潜在危害的问题。这种llm已经可以在当地医院环境中以可承受的成本部署。虽然法学硕士可能还没有准备好作为通用人工智能助手应用于临床，但该技术仍在快速发展。基于现实情况的评估研究对于更好地了解法学硕士在临床实践中的弱点和局限性非常重要。这些研究对于确定该技术何时可以用于临床应用也至关重要。此外，需要对卫生保健专业人员进行关于生成式人工智能的教育，以确保负责任的临床实施这一变革性技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.