Jong Kwon Lee, Sholhui Park, Sang-Hyun Hwang, Jaejoon Lee, Duck Cho, Sooin Choi
{"title":"输血医学中六个大型语言模型的比较评估:解决语言和领域特定挑战。","authors":"Jong Kwon Lee, Sholhui Park, Sang-Hyun Hwang, Jaejoon Lee, Duck Cho, Sooin Choi","doi":"10.1111/vox.70050","DOIUrl":null,"url":null,"abstract":"<p><strong>Background and objectives: </strong>Large language models (LLMs) such as GPT-4 are increasingly utilized in clinical and educational settings; however, their validity in subspecialized domains like transfusion medicine remains insufficiently characterized. This study assessed the performance of six LLMs on transfusion-related questions from Korean national licensing examinations for medical doctors (MDs) and medical technologists (MTs).</p><p><strong>Materials and methods: </strong>A total of 23 MD and 67 MT questions (2020-2023) were extracted from publicly available sources. All items were originally written in Korean and subsequently translated into English to evaluate cross-linguistic performance. Each model received standardized multiple-choice prompts (five options), and correctness was determined by explicit answer selection. Accuracy was calculated as the proportion of correct responses, with 0.75 designated as the performance threshold. Chi-square tests were employed to analyse language-based differences.</p><p><strong>Results: </strong>GPT-4 and GPT-4o consistently surpassed the 0.75 threshold across both languages and examination types. GPT-3.5 demonstrated reasonable accuracy in English but showed a marked decline in Korean, suggesting limitations in multilingual generalization. Gemini 1.5 outperformed Gemini 1, particularly in Korean, though both exhibited variability across technical subdomains. Clova X showed inconsistent results across settings. All models demonstrated limited performance in legal and ethical scenarios.</p><p><strong>Conclusion: </strong>GPT-4 and GPT-4o exhibited robust and reliable performance across a range of transfusion medicine topics. Nonetheless, inter-model and inter-language variability highlights the need for targeted fine-tuning, particularly in the context of local regulatory and ethical frameworks, to support safe and context-appropriate implementation in clinical practice.</p>","PeriodicalId":23631,"journal":{"name":"Vox Sanguinis","volume":" ","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparative evaluation of six large language models in transfusion medicine: Addressing language and domain-specific challenges.\",\"authors\":\"Jong Kwon Lee, Sholhui Park, Sang-Hyun Hwang, Jaejoon Lee, Duck Cho, Sooin Choi\",\"doi\":\"10.1111/vox.70050\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background and objectives: </strong>Large language models (LLMs) such as GPT-4 are increasingly utilized in clinical and educational settings; however, their validity in subspecialized domains like transfusion medicine remains insufficiently characterized. This study assessed the performance of six LLMs on transfusion-related questions from Korean national licensing examinations for medical doctors (MDs) and medical technologists (MTs).</p><p><strong>Materials and methods: </strong>A total of 23 MD and 67 MT questions (2020-2023) were extracted from publicly available sources. All items were originally written in Korean and subsequently translated into English to evaluate cross-linguistic performance. Each model received standardized multiple-choice prompts (five options), and correctness was determined by explicit answer selection. Accuracy was calculated as the proportion of correct responses, with 0.75 designated as the performance threshold. Chi-square tests were employed to analyse language-based differences.</p><p><strong>Results: </strong>GPT-4 and GPT-4o consistently surpassed the 0.75 threshold across both languages and examination types. GPT-3.5 demonstrated reasonable accuracy in English but showed a marked decline in Korean, suggesting limitations in multilingual generalization. Gemini 1.5 outperformed Gemini 1, particularly in Korean, though both exhibited variability across technical subdomains. Clova X showed inconsistent results across settings. All models demonstrated limited performance in legal and ethical scenarios.</p><p><strong>Conclusion: </strong>GPT-4 and GPT-4o exhibited robust and reliable performance across a range of transfusion medicine topics. Nonetheless, inter-model and inter-language variability highlights the need for targeted fine-tuning, particularly in the context of local regulatory and ethical frameworks, to support safe and context-appropriate implementation in clinical practice.</p>\",\"PeriodicalId\":23631,\"journal\":{\"name\":\"Vox Sanguinis\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Vox Sanguinis\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1111/vox.70050\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEMATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vox Sanguinis","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/vox.70050","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEMATOLOGY","Score":null,"Total":0}
Comparative evaluation of six large language models in transfusion medicine: Addressing language and domain-specific challenges.
Background and objectives: Large language models (LLMs) such as GPT-4 are increasingly utilized in clinical and educational settings; however, their validity in subspecialized domains like transfusion medicine remains insufficiently characterized. This study assessed the performance of six LLMs on transfusion-related questions from Korean national licensing examinations for medical doctors (MDs) and medical technologists (MTs).
Materials and methods: A total of 23 MD and 67 MT questions (2020-2023) were extracted from publicly available sources. All items were originally written in Korean and subsequently translated into English to evaluate cross-linguistic performance. Each model received standardized multiple-choice prompts (five options), and correctness was determined by explicit answer selection. Accuracy was calculated as the proportion of correct responses, with 0.75 designated as the performance threshold. Chi-square tests were employed to analyse language-based differences.
Results: GPT-4 and GPT-4o consistently surpassed the 0.75 threshold across both languages and examination types. GPT-3.5 demonstrated reasonable accuracy in English but showed a marked decline in Korean, suggesting limitations in multilingual generalization. Gemini 1.5 outperformed Gemini 1, particularly in Korean, though both exhibited variability across technical subdomains. Clova X showed inconsistent results across settings. All models demonstrated limited performance in legal and ethical scenarios.
Conclusion: GPT-4 and GPT-4o exhibited robust and reliable performance across a range of transfusion medicine topics. Nonetheless, inter-model and inter-language variability highlights the need for targeted fine-tuning, particularly in the context of local regulatory and ethical frameworks, to support safe and context-appropriate implementation in clinical practice.
期刊介绍:
Vox Sanguinis reports on important, novel developments in transfusion medicine. Original papers, reviews and international fora are published on all aspects of blood transfusion and tissue transplantation, comprising five main sections:
1) Transfusion - Transmitted Disease and its Prevention:
Identification and epidemiology of infectious agents transmissible by blood;
Bacterial contamination of blood components;
Donor recruitment and selection methods;
Pathogen inactivation.
2) Blood Component Collection and Production:
Blood collection methods and devices (including apheresis);
Plasma fractionation techniques and plasma derivatives;
Preparation of labile blood components;
Inventory management;
Hematopoietic progenitor cell collection and storage;
Collection and storage of tissues;
Quality management and good manufacturing practice;
Automation and information technology.
3) Transfusion Medicine and New Therapies:
Transfusion thresholds and audits;
Haemovigilance;
Clinical trials regarding appropriate haemotherapy;
Non-infectious adverse affects of transfusion;
Therapeutic apheresis;
Support of transplant patients;
Gene therapy and immunotherapy.
4) Immunohaematology and Immunogenetics:
Autoimmunity in haematology;
Alloimmunity of blood;
Pre-transfusion testing;
Immunodiagnostics;
Immunobiology;
Complement in immunohaematology;
Blood typing reagents;
Genetic markers of blood cells and serum proteins: polymorphisms and function;
Genetic markers and disease;
Parentage testing and forensic immunohaematology.
5) Cellular Therapy:
Cell-based therapies;
Stem cell sources;
Stem cell processing and storage;
Stem cell products;
Stem cell plasticity;
Regenerative medicine with cells;
Cellular immunotherapy;
Molecular therapy;
Gene therapy.