Comparative evaluation of six large language models in transfusion medicine: Addressing language and domain-specific challenges.

IF 1.6 4区 医学 Q3 HEMATOLOGY
Vox Sanguinis Pub Date : 2025-05-23 DOI:10.1111/vox.70050
Jong Kwon Lee, Sholhui Park, Sang-Hyun Hwang, Jaejoon Lee, Duck Cho, Sooin Choi
{"title":"Comparative evaluation of six large language models in transfusion medicine: Addressing language and domain-specific challenges.","authors":"Jong Kwon Lee, Sholhui Park, Sang-Hyun Hwang, Jaejoon Lee, Duck Cho, Sooin Choi","doi":"10.1111/vox.70050","DOIUrl":null,"url":null,"abstract":"<p><strong>Background and objectives: </strong>Large language models (LLMs) such as GPT-4 are increasingly utilized in clinical and educational settings; however, their validity in subspecialized domains like transfusion medicine remains insufficiently characterized. This study assessed the performance of six LLMs on transfusion-related questions from Korean national licensing examinations for medical doctors (MDs) and medical technologists (MTs).</p><p><strong>Materials and methods: </strong>A total of 23 MD and 67 MT questions (2020-2023) were extracted from publicly available sources. All items were originally written in Korean and subsequently translated into English to evaluate cross-linguistic performance. Each model received standardized multiple-choice prompts (five options), and correctness was determined by explicit answer selection. Accuracy was calculated as the proportion of correct responses, with 0.75 designated as the performance threshold. Chi-square tests were employed to analyse language-based differences.</p><p><strong>Results: </strong>GPT-4 and GPT-4o consistently surpassed the 0.75 threshold across both languages and examination types. GPT-3.5 demonstrated reasonable accuracy in English but showed a marked decline in Korean, suggesting limitations in multilingual generalization. Gemini 1.5 outperformed Gemini 1, particularly in Korean, though both exhibited variability across technical subdomains. Clova X showed inconsistent results across settings. All models demonstrated limited performance in legal and ethical scenarios.</p><p><strong>Conclusion: </strong>GPT-4 and GPT-4o exhibited robust and reliable performance across a range of transfusion medicine topics. Nonetheless, inter-model and inter-language variability highlights the need for targeted fine-tuning, particularly in the context of local regulatory and ethical frameworks, to support safe and context-appropriate implementation in clinical practice.</p>","PeriodicalId":23631,"journal":{"name":"Vox Sanguinis","volume":" ","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vox Sanguinis","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/vox.70050","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEMATOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background and objectives: Large language models (LLMs) such as GPT-4 are increasingly utilized in clinical and educational settings; however, their validity in subspecialized domains like transfusion medicine remains insufficiently characterized. This study assessed the performance of six LLMs on transfusion-related questions from Korean national licensing examinations for medical doctors (MDs) and medical technologists (MTs).

Materials and methods: A total of 23 MD and 67 MT questions (2020-2023) were extracted from publicly available sources. All items were originally written in Korean and subsequently translated into English to evaluate cross-linguistic performance. Each model received standardized multiple-choice prompts (five options), and correctness was determined by explicit answer selection. Accuracy was calculated as the proportion of correct responses, with 0.75 designated as the performance threshold. Chi-square tests were employed to analyse language-based differences.

Results: GPT-4 and GPT-4o consistently surpassed the 0.75 threshold across both languages and examination types. GPT-3.5 demonstrated reasonable accuracy in English but showed a marked decline in Korean, suggesting limitations in multilingual generalization. Gemini 1.5 outperformed Gemini 1, particularly in Korean, though both exhibited variability across technical subdomains. Clova X showed inconsistent results across settings. All models demonstrated limited performance in legal and ethical scenarios.

Conclusion: GPT-4 and GPT-4o exhibited robust and reliable performance across a range of transfusion medicine topics. Nonetheless, inter-model and inter-language variability highlights the need for targeted fine-tuning, particularly in the context of local regulatory and ethical frameworks, to support safe and context-appropriate implementation in clinical practice.

输血医学中六个大型语言模型的比较评估:解决语言和领域特定挑战。
背景和目标:大型语言模型(LLMs)如GPT-4越来越多地用于临床和教育环境;然而,它们在输血医学等亚专业领域的有效性仍然缺乏充分的特征。本研究评估了六位法学硕士在韩国国家医生(md)和医疗技术人员(mt)执照考试中与输血相关的问题上的表现。材料和方法:从公开来源中提取了23个MD和67个MT问题(2020-2023)。所有项目最初都是用韩语写的,然后翻译成英语,以评估跨语言表现。每个模型都收到标准化的多项选择提示(五个选项),并通过明确的答案选择来确定正确性。准确度以正确回答的比例计算,0.75作为性能阈值。采用卡方检验分析基于语言的差异。结果:GPT-4和gpt - 40在语言和考试类型中均超过0.75阈值。GPT-3.5在英语中表现出合理的准确性,但在韩国语中表现出明显的下降,表明在多语言泛化方面存在局限性。Gemini 1.5的表现优于Gemini 1,尤其是在韩语方面,尽管两者在技术子领域都表现出差异性。Clova X显示不同设置的结果不一致。所有模型在法律和道德方面的表现都是有限的。结论:GPT-4和gpt - 40在一系列输血医学主题中表现出稳健可靠的性能。尽管如此,模型间和语言间的可变性强调了有针对性的微调的必要性,特别是在当地监管和伦理框架的背景下,以支持在临床实践中安全且符合环境的实施。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Vox Sanguinis
Vox Sanguinis 医学-血液学
CiteScore
4.40
自引率
11.10%
发文量
156
审稿时长
6-12 weeks
期刊介绍: Vox Sanguinis reports on important, novel developments in transfusion medicine. Original papers, reviews and international fora are published on all aspects of blood transfusion and tissue transplantation, comprising five main sections: 1) Transfusion - Transmitted Disease and its Prevention: Identification and epidemiology of infectious agents transmissible by blood; Bacterial contamination of blood components; Donor recruitment and selection methods; Pathogen inactivation. 2) Blood Component Collection and Production: Blood collection methods and devices (including apheresis); Plasma fractionation techniques and plasma derivatives; Preparation of labile blood components; Inventory management; Hematopoietic progenitor cell collection and storage; Collection and storage of tissues; Quality management and good manufacturing practice; Automation and information technology. 3) Transfusion Medicine and New Therapies: Transfusion thresholds and audits; Haemovigilance; Clinical trials regarding appropriate haemotherapy; Non-infectious adverse affects of transfusion; Therapeutic apheresis; Support of transplant patients; Gene therapy and immunotherapy. 4) Immunohaematology and Immunogenetics: Autoimmunity in haematology; Alloimmunity of blood; Pre-transfusion testing; Immunodiagnostics; Immunobiology; Complement in immunohaematology; Blood typing reagents; Genetic markers of blood cells and serum proteins: polymorphisms and function; Genetic markers and disease; Parentage testing and forensic immunohaematology. 5) Cellular Therapy: Cell-based therapies; Stem cell sources; Stem cell processing and storage; Stem cell products; Stem cell plasticity; Regenerative medicine with cells; Cellular immunotherapy; Molecular therapy; Gene therapy.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信