6大语言模型临床决策支持评价：在RhD血型患者输血决策中的应用。

IF 3.9 2区医学 Q1 MEDICAL LABORATORY TECHNOLOGY

Annals of Laboratory Medicine Pub Date : 2025-04-28 DOI:10.3343/alm.2024.0588

Jong Kwon Lee,Sooin Choi,Sholhui Park,Sang-Hyun Hwang,Duck Cho

{"title":"6大语言模型临床决策支持评价：在RhD血型患者输血决策中的应用。","authors":"Jong Kwon Lee,Sooin Choi,Sholhui Park,Sang-Hyun Hwang,Duck Cho","doi":"10.3343/alm.2024.0588","DOIUrl":null,"url":null,"abstract":"Background\r\nLarge language models (LLMs) have the potential for clinical decision support; however, their use in specific tasks, such as determining the RhD blood type for transfusion, remains underexplored. Therefore, we evaluated the accuracy of six LLMs in addressing RhD blood type-related issues in Korean healthcare.\r\n\r\nMethods\r\nFifteen multiple-choice and true/false questions, based on real-world transfusion scenarios and reviewed by specialists, were developed. The questions were administered twice to six LLMs (Clova X, Gemini 1.0, Gemini 1.5, ChatGPT-3.5, GPT-4.0, and GPT-4o) in both Korean and English. Results were compared against the performance of 22 transfusion medicine experts. For particularly challenging questions, prompt engineering was applied, and the questions were reevaluated.\r\n\r\nResults\r\nGPT-4o demonstrated the highest accuracy rate in Korean (0.6), with significant differences compared with those of Clova X and Gemini (P <0.05). In English, the results were similar across all models. The transfusion experts achieved a higher accuracy rate (0.8). Among the five questions subjected to prompt engineering, only GPT-4o correctly responded to one, whereas the other models failed. All LLM models changed their responses or did not respond when the same question was repeated.\r\n\r\nConclusions\r\nGPT-4o showed the best overall performance among the models tested and may be beneficial in RhD blood product transfusion decision-making. However, its performance suggests that it may serve best in a supportive role rather than as a primary decision-making tool.","PeriodicalId":8421,"journal":{"name":"Annals of Laboratory Medicine","volume":"10 1","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of Six Large Language Models for Clinical Decision Support: Application in Transfusion Decision-making for RhD Blood-type Patients.\",\"authors\":\"Jong Kwon Lee,Sooin Choi,Sholhui Park,Sang-Hyun Hwang,Duck Cho\",\"doi\":\"10.3343/alm.2024.0588\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background\\r\\nLarge language models (LLMs) have the potential for clinical decision support; however, their use in specific tasks, such as determining the RhD blood type for transfusion, remains underexplored. Therefore, we evaluated the accuracy of six LLMs in addressing RhD blood type-related issues in Korean healthcare.\\r\\n\\r\\nMethods\\r\\nFifteen multiple-choice and true/false questions, based on real-world transfusion scenarios and reviewed by specialists, were developed. The questions were administered twice to six LLMs (Clova X, Gemini 1.0, Gemini 1.5, ChatGPT-3.5, GPT-4.0, and GPT-4o) in both Korean and English. Results were compared against the performance of 22 transfusion medicine experts. For particularly challenging questions, prompt engineering was applied, and the questions were reevaluated.\\r\\n\\r\\nResults\\r\\nGPT-4o demonstrated the highest accuracy rate in Korean (0.6), with significant differences compared with those of Clova X and Gemini (P <0.05). In English, the results were similar across all models. The transfusion experts achieved a higher accuracy rate (0.8). Among the five questions subjected to prompt engineering, only GPT-4o correctly responded to one, whereas the other models failed. All LLM models changed their responses or did not respond when the same question was repeated.\\r\\n\\r\\nConclusions\\r\\nGPT-4o showed the best overall performance among the models tested and may be beneficial in RhD blood product transfusion decision-making. However, its performance suggests that it may serve best in a supportive role rather than as a primary decision-making tool.\",\"PeriodicalId\":8421,\"journal\":{\"name\":\"Annals of Laboratory Medicine\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Laboratory Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3343/alm.2024.0588\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICAL LABORATORY TECHNOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Laboratory Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3343/alm.2024.0588","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLMs）具有临床决策支持的潜力；然而，它们在特定任务中的应用，如确定输血的RhD血型，仍未得到充分探索。因此，我们评估了六个llm在解决韩国医疗保健中RhD血型相关问题的准确性。方法根据真实输血场景，经专家审核，编制了15道选择题和真假题。用韩语和英语对6名法学硕士（Clova X、Gemini 1.0、Gemini 1.5、ChatGPT-3.5、GPT-4.0和gpt - 40）进行了两次问卷调查。结果与22名输血医学专家的表现进行了比较。对于特别具有挑战性的问题，应用提示工程，并重新评估问题。结果gpt - 40在韩语中准确率最高（0.6），与Clova X和Gemini相比差异有统计学意义（P <0.05）。在英语中，所有模型的结果都是相似的。输血专家取得了更高的准确率（0.8）。在接受即时工程的五个问题中，只有gpt - 40正确回答了一个问题，而其他模型都失败了。当同样的问题被重复时，所有LLM模型都改变了他们的回答，或者没有回应。结论sgpt - 40在所有模型中综合性能最好，可能对RhD血制品输血决策有一定的指导意义。然而，它的表现表明，它可能最好发挥支持作用，而不是作为主要决策工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation of Six Large Language Models for Clinical Decision Support: Application in Transfusion Decision-making for RhD Blood-type Patients.

Background Large language models (LLMs) have the potential for clinical decision support; however, their use in specific tasks, such as determining the RhD blood type for transfusion, remains underexplored. Therefore, we evaluated the accuracy of six LLMs in addressing RhD blood type-related issues in Korean healthcare. Methods Fifteen multiple-choice and true/false questions, based on real-world transfusion scenarios and reviewed by specialists, were developed. The questions were administered twice to six LLMs (Clova X, Gemini 1.0, Gemini 1.5, ChatGPT-3.5, GPT-4.0, and GPT-4o) in both Korean and English. Results were compared against the performance of 22 transfusion medicine experts. For particularly challenging questions, prompt engineering was applied, and the questions were reevaluated. Results GPT-4o demonstrated the highest accuracy rate in Korean (0.6), with significant differences compared with those of Clova X and Gemini (P <0.05). In English, the results were similar across all models. The transfusion experts achieved a higher accuracy rate (0.8). Among the five questions subjected to prompt engineering, only GPT-4o correctly responded to one, whereas the other models failed. All LLM models changed their responses or did not respond when the same question was repeated. Conclusions GPT-4o showed the best overall performance among the models tested and may be beneficial in RhD blood product transfusion decision-making. However, its performance suggests that it may serve best in a supportive role rather than as a primary decision-making tool.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Annals of Laboratory Medicine MEDICAL LABORATORY TECHNOLOGY-

CiteScore

8.30

自引率

12.20%

发文量

100

审稿时长

6-12 weeks

期刊介绍： Annals of Laboratory Medicine is the official journal of Korean Society for Laboratory Medicine. The journal title has been recently changed from the Korean Journal of Laboratory Medicine (ISSN, 1598-6535) from the January issue of 2012. The JCR 2017 Impact factor of Ann Lab Med was 1.916.