Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review.

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2024-11-26 DOI:10.1186/s12911-024-02757-z

Cindy N Ho, Tiffany Tian, Alessandra T Ayers, Rachel E Aaron, Vidith Phillips, Risa M Wolf, Nestoras Mathioudakis, Tinglong Dai, David C Klonoff

{"title":"Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review.","authors":"Cindy N Ho, Tiffany Tian, Alessandra T Ayers, Rachel E Aaron, Vidith Phillips, Risa M Wolf, Nestoras Mathioudakis, Tinglong Dai, David C Klonoff","doi":"10.1186/s12911-024-02757-z","DOIUrl":null,"url":null,"abstract":"Background: The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated.Methods: We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans.Results: We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were \"accuracy\", \"completeness\", \"appropriateness\", \"insight\", and \"consistency\".Conclusions: The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"24 1","pages":"357"},"PeriodicalIF":3.3000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11590327/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02757-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated.

Methods: We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans.

Results: We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were "accuracy", "completeness", "appropriateness", "insight", and "consistency".

Conclusions: The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.

查看原文本刊更多论文

生物医学文献中用于评估临床决策中大型语言模型的定性指标：叙述性综述。

背景：2022年11月30日以来发布的大型语言模型（LLM），其中最引人注目的是ChatGPT，促使人们开始关注其在医学中的应用，尤其是在支持临床决策方面。然而，对于如何评估 LLM 在临床环境中的表现，医学界几乎没有达成共识：我们对 PubMed 进行了文献综述，以确定 2022 年 12 月 1 日至 2024 年 4 月 1 日期间发表的、讨论评估 LLM 生成的诊断或治疗方案的文章：我们从 PubMed 中挑选了 108 篇相关文章进行分析。最常用的 LLM 是 GPT-3.5、GPT-4、Bard、基于 LLaMa/Alpaca 的模型和 Bing Chat。最常用的五个 LLM 输出评分标准是 "准确性"、"完整性"、"适当性"、"洞察力 "和 "一致性"：在过去一年半的时间里，研究人员一直在选择最常用的标准来定义高质量的法律硕士。我们发现，在如何报告研究结果和评估 LLM 的绩效方面，研究结果存在很大差异。可以制定标准化的定性评价指标报告，以评估法律硕士产出的质量，从而促进医疗保健领域的法律硕士研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.