Evaluating large language models and agents in healthcare: key challenges in clinical applications

IF 6.9 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Intelligent medicine Pub Date : 2025-05-01 DOI:10.1016/j.imed.2025.03.002

Xiaolan Chen , Jiayang Xiang , Shanfu Lu , Yexin Liu , Mingguang He , Danli Shi

{"title":"Evaluating large language models and agents in healthcare: key challenges in clinical applications","authors":"Xiaolan Chen , Jiayang Xiang , Shanfu Lu , Yexin Liu , Mingguang He , Danli Shi","doi":"10.1016/j.imed.2025.03.002","DOIUrl":null,"url":null,"abstract":"<div><div>Large language models (LLMs) have emerged as transformative tools with significant potential across healthcare and medicine. In clinical settings, they hold promises for tasks ranging from clinical decision support to patient education. Advances in LLM agents further broaden their utility by enabling multimodal processing and multitask handling in complex clinical workflows. However, evaluating the performance of LLMs in medical contexts presents unique challenges due to the high-risk nature of healthcare and the complexity of medical data. This paper provides a comprehensive overview of current evaluation practices for LLMs and LLM agents in medicine. We contributed 3 main aspects: First, we summarized data sources used in evaluations, including existing medical resources and manually designed clinical questions, offering a basis for LLM evaluation in medical settings. Second, we analyzed key medical task scenarios: closed-ended tasks, open-ended tasks, image processing tasks, and real-world multitask scenarios involving LLM agents, thereby offering guidance for further research across different medical applications. Third, we compared evaluation methods and dimensions, covering both automated metrics and human expert assessments, while addressing traditional accuracy measures alongside agent-specific dimensions, such as tool usage and reasoning capabilities. Finally, we identified key challenges and opportunities in this evolving field, emphasizing the need for continued research and interdisciplinary collaboration between healthcare professionals and computer scientists to ensure safe, ethical, and effective deployment of LLMs in clinical practice.</div></div>","PeriodicalId":73400,"journal":{"name":"Intelligent medicine","volume":"5 2","pages":"Pages 151-163"},"PeriodicalIF":6.9000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667102625000294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) have emerged as transformative tools with significant potential across healthcare and medicine. In clinical settings, they hold promises for tasks ranging from clinical decision support to patient education. Advances in LLM agents further broaden their utility by enabling multimodal processing and multitask handling in complex clinical workflows. However, evaluating the performance of LLMs in medical contexts presents unique challenges due to the high-risk nature of healthcare and the complexity of medical data. This paper provides a comprehensive overview of current evaluation practices for LLMs and LLM agents in medicine. We contributed 3 main aspects: First, we summarized data sources used in evaluations, including existing medical resources and manually designed clinical questions, offering a basis for LLM evaluation in medical settings. Second, we analyzed key medical task scenarios: closed-ended tasks, open-ended tasks, image processing tasks, and real-world multitask scenarios involving LLM agents, thereby offering guidance for further research across different medical applications. Third, we compared evaluation methods and dimensions, covering both automated metrics and human expert assessments, while addressing traditional accuracy measures alongside agent-specific dimensions, such as tool usage and reasoning capabilities. Finally, we identified key challenges and opportunities in this evolving field, emphasizing the need for continued research and interdisciplinary collaboration between healthcare professionals and computer scientists to ensure safe, ethical, and effective deployment of LLMs in clinical practice.

查看原文本刊更多论文

评估医疗保健中的大型语言模型和代理：临床应用中的关键挑战

大型语言模型（llm）已经成为在医疗保健和医学领域具有巨大潜力的变革性工具。在临床环境中，他们承担着从临床决策支持到患者教育等任务的承诺。LLM代理的进步通过在复杂的临床工作流程中实现多模式处理和多任务处理进一步扩大了它们的效用。然而，由于医疗保健的高风险性质和医疗数据的复杂性，评估法学硕士在医学背景下的表现提出了独特的挑战。本文提供了一个全面的概述当前的评估实践法学硕士和法学硕士代理人在医学。我们主要贡献了3个方面：首先，我们总结了评估中使用的数据源，包括现有的医疗资源和人工设计的临床问题，为医疗环境中的LLM评估提供了基础。其次，我们分析了关键的医疗任务场景：封闭式任务、开放式任务、图像处理任务和涉及LLM代理的真实多任务场景，从而为不同医疗应用的进一步研究提供指导。第三，我们比较了评估方法和维度，涵盖了自动化度量和人类专家评估，同时解决了传统的准确性度量以及特定于代理的维度，如工具使用和推理能力。最后，我们确定了这一不断发展的领域的关键挑战和机遇，强调了医疗保健专业人员和计算机科学家之间持续研究和跨学科合作的必要性，以确保法学硕士在临床实践中的安全、道德和有效部署。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Intelligent medicine Surgery, Radiology and Imaging, Artificial Intelligence, Biomedical Engineering

CiteScore

5.20

自引率

0.00%

发文量