Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review.

IF 3.3 3区 医学 Q2 MEDICAL INFORMATICS
Cindy N Ho, Tiffany Tian, Alessandra T Ayers, Rachel E Aaron, Vidith Phillips, Risa M Wolf, Nestoras Mathioudakis, Tinglong Dai, David C Klonoff
{"title":"Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review.","authors":"Cindy N Ho, Tiffany Tian, Alessandra T Ayers, Rachel E Aaron, Vidith Phillips, Risa M Wolf, Nestoras Mathioudakis, Tinglong Dai, David C Klonoff","doi":"10.1186/s12911-024-02757-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated.</p><p><strong>Methods: </strong>We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans.</p><p><strong>Results: </strong>We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were \"accuracy\", \"completeness\", \"appropriateness\", \"insight\", and \"consistency\".</p><p><strong>Conclusions: </strong>The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"24 1","pages":"357"},"PeriodicalIF":3.3000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02757-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated.

Methods: We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans.

Results: We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were "accuracy", "completeness", "appropriateness", "insight", and "consistency".

Conclusions: The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.

生物医学文献中用于评估临床决策中大型语言模型的定性指标:叙述性综述。
背景:2022年11月30日以来发布的大型语言模型(LLM),其中最引人注目的是ChatGPT,促使人们开始关注其在医学中的应用,尤其是在支持临床决策方面。然而,对于如何评估 LLM 在临床环境中的表现,医学界几乎没有达成共识:我们对 PubMed 进行了文献综述,以确定 2022 年 12 月 1 日至 2024 年 4 月 1 日期间发表的、讨论评估 LLM 生成的诊断或治疗方案的文章:我们从 PubMed 中挑选了 108 篇相关文章进行分析。最常用的 LLM 是 GPT-3.5、GPT-4、Bard、基于 LLaMa/Alpaca 的模型和 Bing Chat。最常用的五个 LLM 输出评分标准是 "准确性"、"完整性"、"适当性"、"洞察力 "和 "一致性":在过去一年半的时间里,研究人员一直在选择最常用的标准来定义高质量的法律硕士。我们发现,在如何报告研究结果和评估 LLM 的绩效方面,研究结果存在很大差异。可以制定标准化的定性评价指标报告,以评估法律硕士产出的质量,从而促进医疗保健领域的法律硕士研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.20
自引率
5.70%
发文量
297
审稿时长
1 months
期刊介绍: BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信