Evaluation of Six Large Language Models for Clinical Decision Support: Application in Transfusion Decision-making for RhD Blood-type Patients.

IF 4 2区 医学 Q1 MEDICAL LABORATORY TECHNOLOGY
Jong Kwon Lee,Sooin Choi,Sholhui Park,Sang-Hyun Hwang,Duck Cho
{"title":"Evaluation of Six Large Language Models for Clinical Decision Support: Application in Transfusion Decision-making for RhD Blood-type Patients.","authors":"Jong Kwon Lee,Sooin Choi,Sholhui Park,Sang-Hyun Hwang,Duck Cho","doi":"10.3343/alm.2024.0588","DOIUrl":null,"url":null,"abstract":"Background\r\nLarge language models (LLMs) have the potential for clinical decision support; however, their use in specific tasks, such as determining the RhD blood type for transfusion, remains underexplored. Therefore, we evaluated the accuracy of six LLMs in addressing RhD blood type-related issues in Korean healthcare.\r\n\r\nMethods\r\nFifteen multiple-choice and true/false questions, based on real-world transfusion scenarios and reviewed by specialists, were developed. The questions were administered twice to six LLMs (Clova X, Gemini 1.0, Gemini 1.5, ChatGPT-3.5, GPT-4.0, and GPT-4o) in both Korean and English. Results were compared against the performance of 22 transfusion medicine experts. For particularly challenging questions, prompt engineering was applied, and the questions were reevaluated.\r\n\r\nResults\r\nGPT-4o demonstrated the highest accuracy rate in Korean (0.6), with significant differences compared with those of Clova X and Gemini (P <0.05). In English, the results were similar across all models. The transfusion experts achieved a higher accuracy rate (0.8). Among the five questions subjected to prompt engineering, only GPT-4o correctly responded to one, whereas the other models failed. All LLM models changed their responses or did not respond when the same question was repeated.\r\n\r\nConclusions\r\nGPT-4o showed the best overall performance among the models tested and may be beneficial in RhD blood product transfusion decision-making. However, its performance suggests that it may serve best in a supportive role rather than as a primary decision-making tool.","PeriodicalId":8421,"journal":{"name":"Annals of Laboratory Medicine","volume":"10 1","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Laboratory Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3343/alm.2024.0588","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background Large language models (LLMs) have the potential for clinical decision support; however, their use in specific tasks, such as determining the RhD blood type for transfusion, remains underexplored. Therefore, we evaluated the accuracy of six LLMs in addressing RhD blood type-related issues in Korean healthcare. Methods Fifteen multiple-choice and true/false questions, based on real-world transfusion scenarios and reviewed by specialists, were developed. The questions were administered twice to six LLMs (Clova X, Gemini 1.0, Gemini 1.5, ChatGPT-3.5, GPT-4.0, and GPT-4o) in both Korean and English. Results were compared against the performance of 22 transfusion medicine experts. For particularly challenging questions, prompt engineering was applied, and the questions were reevaluated. Results GPT-4o demonstrated the highest accuracy rate in Korean (0.6), with significant differences compared with those of Clova X and Gemini (P <0.05). In English, the results were similar across all models. The transfusion experts achieved a higher accuracy rate (0.8). Among the five questions subjected to prompt engineering, only GPT-4o correctly responded to one, whereas the other models failed. All LLM models changed their responses or did not respond when the same question was repeated. Conclusions GPT-4o showed the best overall performance among the models tested and may be beneficial in RhD blood product transfusion decision-making. However, its performance suggests that it may serve best in a supportive role rather than as a primary decision-making tool.
6大语言模型临床决策支持评价:在RhD血型患者输血决策中的应用。
大型语言模型(LLMs)具有临床决策支持的潜力;然而,它们在特定任务中的应用,如确定输血的RhD血型,仍未得到充分探索。因此,我们评估了六个llm在解决韩国医疗保健中RhD血型相关问题的准确性。方法根据真实输血场景,经专家审核,编制了15道选择题和真假题。用韩语和英语对6名法学硕士(Clova X、Gemini 1.0、Gemini 1.5、ChatGPT-3.5、GPT-4.0和gpt - 40)进行了两次问卷调查。结果与22名输血医学专家的表现进行了比较。对于特别具有挑战性的问题,应用提示工程,并重新评估问题。结果gpt - 40在韩语中准确率最高(0.6),与Clova X和Gemini相比差异有统计学意义(P <0.05)。在英语中,所有模型的结果都是相似的。输血专家取得了更高的准确率(0.8)。在接受即时工程的五个问题中,只有gpt - 40正确回答了一个问题,而其他模型都失败了。当同样的问题被重复时,所有LLM模型都改变了他们的回答,或者没有回应。结论sgpt - 40在所有模型中综合性能最好,可能对RhD血制品输血决策有一定的指导意义。然而,它的表现表明,它可能最好发挥支持作用,而不是作为主要决策工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Annals of Laboratory Medicine
Annals of Laboratory Medicine MEDICAL LABORATORY TECHNOLOGY-
CiteScore
8.30
自引率
12.20%
发文量
100
审稿时长
6-12 weeks
期刊介绍: Annals of Laboratory Medicine is the official journal of Korean Society for Laboratory Medicine. The journal title has been recently changed from the Korean Journal of Laboratory Medicine (ISSN, 1598-6535) from the January issue of 2012. The JCR 2017 Impact factor of Ann Lab Med was 1.916.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信