Assessment of large language models in medical quizzes for clinical chemistry and laboratory management: implications and applications for healthcare artificial intelligence.

IF 1.3 4区 医学 Q4 MEDICINE, RESEARCH & EXPERIMENTAL
Won Young Heo, Hyung-Doo Park
{"title":"Assessment of large language models in medical quizzes for clinical chemistry and laboratory management: implications and applications for healthcare artificial intelligence.","authors":"Won Young Heo, Hyung-Doo Park","doi":"10.1080/00365513.2025.2466054","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) have demonstrated high performance across various fields due to their ability to understand, generate, and manipulate human language. However, their potential in specialized medical domains, such as clinical chemistry and laboratory management, remains underexplored. This study evaluated the performance of nine LLMs using zero-shot prompting on 109 clinical problem-based quizzes from peer-reviewed journal articles in the Laboratory Medicine Online (LMO) database. These quizzes covered topics in clinical chemistry, toxicology, and laboratory management. The models, including GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, along with their earlier or smaller versions, were assigned roles as clinical chemists or laboratory managers to simulate real-world decision-making scenarios. Among the evaluated models, GPT-4o achieved the highest overall accuracy, correctly answering 81.7% of the quizzes, followed by GPT-4 Turbo (76.1%), Claude 3 Opus (74.3%), and Gemini 1.5 Pro (69.7%), while the lowest performance was observed with Gemini 1.0 Pro (51.4%). GPT-4o performed exceptionally well across all quiz types, including single-select, open-ended, and multiple-select questions, and demonstrated particular strength in quizzes involving figures, tables, or calculations. These findings highlight the ability of LLMs to effectively apply their pre-existing knowledge base to specialized clinical chemistry inquiries without additional fine-tuning. Among the evaluated models, GPT-4o exhibited superior performance across different quiz types, underscoring its potential utility in assisting healthcare professionals in clinical decision-making.</p>","PeriodicalId":21474,"journal":{"name":"Scandinavian Journal of Clinical & Laboratory Investigation","volume":" ","pages":"125-132"},"PeriodicalIF":1.3000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scandinavian Journal of Clinical & Laboratory Investigation","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/00365513.2025.2466054","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/19 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

Large language models (LLMs) have demonstrated high performance across various fields due to their ability to understand, generate, and manipulate human language. However, their potential in specialized medical domains, such as clinical chemistry and laboratory management, remains underexplored. This study evaluated the performance of nine LLMs using zero-shot prompting on 109 clinical problem-based quizzes from peer-reviewed journal articles in the Laboratory Medicine Online (LMO) database. These quizzes covered topics in clinical chemistry, toxicology, and laboratory management. The models, including GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, along with their earlier or smaller versions, were assigned roles as clinical chemists or laboratory managers to simulate real-world decision-making scenarios. Among the evaluated models, GPT-4o achieved the highest overall accuracy, correctly answering 81.7% of the quizzes, followed by GPT-4 Turbo (76.1%), Claude 3 Opus (74.3%), and Gemini 1.5 Pro (69.7%), while the lowest performance was observed with Gemini 1.0 Pro (51.4%). GPT-4o performed exceptionally well across all quiz types, including single-select, open-ended, and multiple-select questions, and demonstrated particular strength in quizzes involving figures, tables, or calculations. These findings highlight the ability of LLMs to effectively apply their pre-existing knowledge base to specialized clinical chemistry inquiries without additional fine-tuning. Among the evaluated models, GPT-4o exhibited superior performance across different quiz types, underscoring its potential utility in assisting healthcare professionals in clinical decision-making.

临床化学和实验室管理医学测验中大型语言模型的评估:对医疗人工智能的影响和应用。
大型语言模型(llm)由于其理解、生成和操纵人类语言的能力,在各个领域都表现出了高性能。然而,它们在专业医学领域的潜力,如临床化学和实验室管理,仍未得到充分开发。本研究评估了9名法学硕士在109个基于临床问题的测验中的表现,这些测验来自实验室医学在线(LMO)数据库中同行评议的期刊文章。这些测验涵盖了临床化学、毒理学和实验室管理的主题。这些模型,包括gpt - 40、Claude 3 Opus和Gemini 1.5 Pro,以及它们早期或较小的版本,被分配为临床化学家或实验室经理的角色,以模拟现实世界的决策场景。在被评估的模型中,gpt - 40的整体准确率最高,正确回答了81.7%的问题,其次是GPT-4 Turbo(76.1%)、Claude 3 Opus(74.3%)和Gemini 1.5 Pro(69.7%),而Gemini 1.0 Pro的表现最低(51.4%)。gpt - 40在所有类型的测验中都表现得非常好,包括单选题、开放式和多项选择题,并在涉及数字、表格或计算的测验中表现出特别的优势。这些发现突出了llm有效地将他们已有的知识库应用于专业临床化学查询的能力,而无需额外的微调。在评估模型中,gpt - 40在不同测验类型中表现优异,强调了其在协助医疗保健专业人员临床决策方面的潜在效用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.50
自引率
4.80%
发文量
85
审稿时长
4-8 weeks
期刊介绍: The Scandinavian Journal of Clinical and Laboratory Investigation is an international scientific journal covering clinically oriented biochemical and physiological research. Since the launch of the journal in 1949, it has been a forum for international laboratory medicine, closely related to, and edited by, The Scandinavian Society for Clinical Chemistry. The journal contains peer-reviewed articles, editorials, invited reviews, and short technical notes, as well as several supplements each year. Supplements consist of monographs, and symposium and congress reports covering subjects within clinical chemistry and clinical physiology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信