Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.

IF 5.8 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Internet Research Pub Date : 2025-07-11 DOI:10.2196/71916

HongYi Li, Jun-Fen Fu, Andre Python

{"title":"Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.","authors":"HongYi Li, Jun-Fen Fu, Andre Python","doi":"10.2196/71916","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) can generate outputs understandable by humans, such as answers to medical questions and radiology reports. With the rapid development of LLMs, clinicians face a growing challenge in determining the most suitable algorithms to support their work.Objective: We aimed to provide clinicians and other health care practitioners with systematic guidance in selecting an LLM that is relevant and appropriate to their needs and facilitate the integration process of LLMs in health care.Methods: We conducted a literature search of full-text publications in English on clinical applications of LLMs published between January 1, 2022, and March 31, 2025, on PubMed, ScienceDirect, Scopus, and IEEE Xplore. We excluded papers from journals below a set citation threshold, as well as papers that did not focus on LLMs, were not research based, or did not involve clinical applications. We also conducted a literature search on arXiv within the same investigated period and included papers on the clinical applications of innovative multimodal LLMs. This led to a total of 270 studies.Results: We collected 330 LLMs and recorded their application frequency in clinical tasks and frequency of best performance in their context. On the basis of a 5-stage clinical workflow, we found that stages 2, 3, and 4 are key stages in the clinical workflow, involving numerous clinical subtasks and LLMs. However, the diversity of LLMs that may perform optimally in each context remains limited. GPT-3.5 and GPT-4 were the most versatile models in the 5-stage clinical workflow, applied to 52% (29/56) and 71% (40/56) of the clinical subtasks, respectively, and they performed best in 29% (16/56) and 54% (30/56) of the clinical subtasks, respectively. General-purpose LLMs may not perform well in specialized areas as they often require lightweight prompt engineering methods or fine-tuning techniques based on specific datasets to improve model performance. Most LLMs with multimodal abilities are closed-source models and, therefore, lack of transparency, model customization, and fine-tuning for specific clinical tasks and may also pose challenges regarding data protection and privacy, which are common requirements in clinical settings.Conclusions: In this review, we found that LLMs may help clinicians in a variety of clinical tasks. However, we did not find evidence of generalist clinical LLMs successfully applicable to a wide range of clinical tasks. Therefore, their clinical deployment remains challenging. On the basis of this review, we propose an interactive online guideline for clinicians to select suitable LLMs by clinical task. With a clinical perspective and free of unnecessary technical jargon, this guideline may be used as a reference to successfully apply LLMs in clinical settings.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e71916"},"PeriodicalIF":5.8000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/71916","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) can generate outputs understandable by humans, such as answers to medical questions and radiology reports. With the rapid development of LLMs, clinicians face a growing challenge in determining the most suitable algorithms to support their work.

Objective: We aimed to provide clinicians and other health care practitioners with systematic guidance in selecting an LLM that is relevant and appropriate to their needs and facilitate the integration process of LLMs in health care.

Methods: We conducted a literature search of full-text publications in English on clinical applications of LLMs published between January 1, 2022, and March 31, 2025, on PubMed, ScienceDirect, Scopus, and IEEE Xplore. We excluded papers from journals below a set citation threshold, as well as papers that did not focus on LLMs, were not research based, or did not involve clinical applications. We also conducted a literature search on arXiv within the same investigated period and included papers on the clinical applications of innovative multimodal LLMs. This led to a total of 270 studies.

Results: We collected 330 LLMs and recorded their application frequency in clinical tasks and frequency of best performance in their context. On the basis of a 5-stage clinical workflow, we found that stages 2, 3, and 4 are key stages in the clinical workflow, involving numerous clinical subtasks and LLMs. However, the diversity of LLMs that may perform optimally in each context remains limited. GPT-3.5 and GPT-4 were the most versatile models in the 5-stage clinical workflow, applied to 52% (29/56) and 71% (40/56) of the clinical subtasks, respectively, and they performed best in 29% (16/56) and 54% (30/56) of the clinical subtasks, respectively. General-purpose LLMs may not perform well in specialized areas as they often require lightweight prompt engineering methods or fine-tuning techniques based on specific datasets to improve model performance. Most LLMs with multimodal abilities are closed-source models and, therefore, lack of transparency, model customization, and fine-tuning for specific clinical tasks and may also pose challenges regarding data protection and privacy, which are common requirements in clinical settings.

Conclusions: In this review, we found that LLMs may help clinicians in a variety of clinical tasks. However, we did not find evidence of generalist clinical LLMs successfully applicable to a wide range of clinical tasks. Therefore, their clinical deployment remains challenging. On the basis of this review, we propose an interactive online guideline for clinicians to select suitable LLMs by clinical task. With a clinical perspective and free of unnecessary technical jargon, this guideline may be used as a reference to successfully apply LLMs in clinical settings.

查看原文本刊更多论文

在卫生保健中实施大型语言模型：以临床医生为中心的交互式指南综述。

背景：大型语言模型（llm）可以生成人类可以理解的输出，例如医学问题的答案和放射学报告。随着法学硕士的快速发展，临床医生在确定最合适的算法来支持他们的工作方面面临着越来越大的挑战。目的：我们旨在为临床医生和其他卫生保健从业人员提供系统的指导，以选择与他们的需求相关和合适的法学硕士，并促进法学硕士在卫生保健领域的整合过程。方法：我们检索了2022年1月1日至2025年3月31日在PubMed、ScienceDirect、Scopus和IEEE explore上发表的关于llm临床应用的英文全文出版物。我们排除了低于设定引用阈值的期刊论文，以及不关注法学硕士、不以研究为基础或不涉及临床应用的论文。我们还在同一调查期间对arXiv进行了文献检索，并纳入了关于创新多式联运llm临床应用的论文。这导致了总共270项研究。结果：我们收集了330个llm，记录了它们在临床任务中的应用频率和在其环境中最佳表现的频率。在5阶段临床工作流程的基础上，我们发现第2、3和4阶段是临床工作流程的关键阶段，涉及众多临床子任务和llm。然而，在每种情况下表现最佳的法学硕士的多样性仍然有限。GPT-3.5和GPT-4是5阶段临床工作流程中最通用的模型，分别适用于52%（29/56）和71%（40/56）的临床子任务，分别在29%（16/56）和54%（30/56）的临床子任务中表现最佳。通用llm在特定领域可能表现不佳，因为它们通常需要轻量级的提示工程方法或基于特定数据集的微调技术来提高模型性能。大多数具有多模式能力的法学硕士都是闭源模型，因此，缺乏透明度、模型定制和针对特定临床任务的微调，也可能对数据保护和隐私构成挑战，这是临床环境中的常见要求。结论：在本综述中，我们发现llm可以帮助临床医生完成各种临床任务。然而，我们没有发现证据表明多面手临床法学硕士成功地适用于广泛的临床任务。因此，它们的临床应用仍然具有挑战性。在此综述的基础上，我们提出了一个交互式在线指南，供临床医生根据临床任务选择合适的法学硕士。从临床的角度来看，没有不必要的技术术语，本指南可以作为参考，成功地在临床环境中应用法学硕士。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.