Analyzing evaluation methods for large language models in the medical field: a scoping review.

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2024-11-29 DOI:10.1186/s12911-024-02709-7

Junbok Lee, Sungkyung Park, Jaeyong Shin, Belong Cho

{"title":"Analyzing evaluation methods for large language models in the medical field: a scoping review.","authors":"Junbok Lee, Sungkyung Park, Jaeyong Shin, Belong Cho","doi":"10.1186/s12911-024-02709-7","DOIUrl":null,"url":null,"abstract":"Background: Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs.Objective: This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies.Methods & materials: We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy.Results: A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering.Conclusions: More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"24 1","pages":"366"},"PeriodicalIF":3.3000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11606129/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02709-7","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs.

Objective: This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies.

Methods & materials: We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy.

Results: A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering.

Conclusions: More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.

查看原文本刊更多论文

医学领域大型语言模型的评价方法分析：范围综述。

背景：随着大语言模型（Large Language Models, llm）的迅速普及，人们进行了各种性能评估研究，以证实其在医学领域的适用性。然而，对法学硕士的评估仍然没有明确的框架。目的：对医学领域法学硕士评价的研究进行综述，并对研究方法进行分析。旨在为今后设计法学硕士课程的研究者提供参考。方法和材料：我们对三个数据库（PubMed、Embase和MEDLINE）进行了范围综述，以确定2023年1月1日至2023年9月30日之间发表的法学硕士相关文章。我们分析了方法的类型、问题（查询）的数量、评估者、重复测量、额外的分析方法、提示工程的使用以及准确性以外的度量。结果：142篇文章符合纳入标准。LLM评估主要分为提供测试检查（n = 53, 37.3%）或由医学专业人员评估（n = 80, 56.3%），还有一些混合病例（n = 5, 3.5%）或两者的结合（n = 4, 2.8%）。大多数研究有100个或更少的问题（n = 18, 29.0%）， 15个（24.2%）进行了重复测量，18个（29.0%）进行了额外分析，8个（12.9%）使用了提示工程。对于医学评估，大多数研究使用50个或更少的问题（n = 54, 64.3%），有两个评估者（n = 43, 48.3%）， 14个（14.7%）使用提示工程。结论：法学硕士在医疗保健中的应用有待进一步研究。虽然以前的研究已经评估了性能，但未来的研究可能会集中在提高性能上。为了系统地进行这些研究，需要一种结构良好的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.