Assessment of large language models in medical quizzes for clinical chemistry and laboratory management: implications and applications for healthcare artificial intelligence.

IF 1.3 4区 医学 Q4 MEDICINE, RESEARCH & EXPERIMENTAL
Won Young Heo, Hyung-Doo Park
{"title":"Assessment of large language models in medical quizzes for clinical chemistry and laboratory management: implications and applications for healthcare artificial intelligence.","authors":"Won Young Heo, Hyung-Doo Park","doi":"10.1080/00365513.2025.2466054","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) have demonstrated high performance across various fields due to their ability to understand, generate, and manipulate human language. However, their potential in specialized medical domains, such as clinical chemistry and laboratory management, remains underexplored. This study evaluated the performance of nine LLMs using zero-shot prompting on 109 clinical problem-based quizzes from peer-reviewed journal articles in the Laboratory Medicine Online (LMO) database. These quizzes covered topics in clinical chemistry, toxicology, and laboratory management. The models, including GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, along with their earlier or smaller versions, were assigned roles as clinical chemists or laboratory managers to simulate real-world decision-making scenarios. Among the evaluated models, GPT-4o achieved the highest overall accuracy, correctly answering 81.7% of the quizzes, followed by GPT-4 Turbo (76.1%), Claude 3 Opus (74.3%), and Gemini 1.5 Pro (69.7%), while the lowest performance was observed with Gemini 1.0 Pro (51.4%). GPT-4o performed exceptionally well across all quiz types, including single-select, open-ended, and multiple-select questions, and demonstrated particular strength in quizzes involving figures, tables, or calculations. These findings highlight the ability of LLMs to effectively apply their pre-existing knowledge base to specialized clinical chemistry inquiries without additional fine-tuning. Among the evaluated models, GPT-4o exhibited superior performance across different quiz types, underscoring its potential utility in assisting healthcare professionals in clinical decision-making.</p>","PeriodicalId":21474,"journal":{"name":"Scandinavian Journal of Clinical & Laboratory Investigation","volume":" ","pages":"125-132"},"PeriodicalIF":1.3000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scandinavian Journal of Clinical & Laboratory Investigation","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/00365513.2025.2466054","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/19 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

Large language models (LLMs) have demonstrated high performance across various fields due to their ability to understand, generate, and manipulate human language. However, their potential in specialized medical domains, such as clinical chemistry and laboratory management, remains underexplored. This study evaluated the performance of nine LLMs using zero-shot prompting on 109 clinical problem-based quizzes from peer-reviewed journal articles in the Laboratory Medicine Online (LMO) database. These quizzes covered topics in clinical chemistry, toxicology, and laboratory management. The models, including GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, along with their earlier or smaller versions, were assigned roles as clinical chemists or laboratory managers to simulate real-world decision-making scenarios. Among the evaluated models, GPT-4o achieved the highest overall accuracy, correctly answering 81.7% of the quizzes, followed by GPT-4 Turbo (76.1%), Claude 3 Opus (74.3%), and Gemini 1.5 Pro (69.7%), while the lowest performance was observed with Gemini 1.0 Pro (51.4%). GPT-4o performed exceptionally well across all quiz types, including single-select, open-ended, and multiple-select questions, and demonstrated particular strength in quizzes involving figures, tables, or calculations. These findings highlight the ability of LLMs to effectively apply their pre-existing knowledge base to specialized clinical chemistry inquiries without additional fine-tuning. Among the evaluated models, GPT-4o exhibited superior performance across different quiz types, underscoring its potential utility in assisting healthcare professionals in clinical decision-making.

求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.50
自引率
4.80%
发文量
85
审稿时长
4-8 weeks
期刊介绍: The Scandinavian Journal of Clinical and Laboratory Investigation is an international scientific journal covering clinically oriented biochemical and physiological research. Since the launch of the journal in 1949, it has been a forum for international laboratory medicine, closely related to, and edited by, The Scandinavian Society for Clinical Chemistry. The journal contains peer-reviewed articles, editorials, invited reviews, and short technical notes, as well as several supplements each year. Supplements consist of monographs, and symposium and congress reports covering subjects within clinical chemistry and clinical physiology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信