三种大型语言模型在临床决策支持中的性能评价：基于实际病例的比较研究。

IF 3.5 3区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Systems Pub Date : 2025-02-14 DOI:10.1007/s10916-025-02152-9

Xueqi Wang, Haiyan Ye, Sumian Zhang, Mei Yang, Xuebin Wang

{"title":"三种大型语言模型在临床决策支持中的性能评价：基于实际病例的比较研究。","authors":"Xueqi Wang, Haiyan Ye, Sumian Zhang, Mei Yang, Xuebin Wang","doi":"10.1007/s10916-025-02152-9","DOIUrl":null,"url":null,"abstract":"Background: Generative large language models (LLMs) are increasingly integrated into the medical field. However, their actual efficacy in clinical decision-making remains partially unexplored. This study aimed to assess the performance of the three LLMs, ChatGPT-4, Gemini, and Med-Go, in the domain of professional medicine when confronted with actual clinical cases.Methods: This study involved 134 clinical cases spanning nine medical disciplines. Each LLM was required to provide suggestions for diagnosis, diagnostic criteria, differential diagnosis, examination and treatment for every case. Responses were scored by two experts using a predefined rubric.Results: In overall performance among the models, Med-Go achieved the highest median score (37.5, IQR 31.9-41.5), while Gemini recorded the lowest (33.0, IQR 25.5-36.6), showing significant statistical difference among the three LLMs (p < 0.001). Analysis revealed that responses related to differential diagnosis were the weakest, while those pertaining to treatment recommendations were the strongest. Med-Go displayed notable performance advantages in gastroenterology, nephrology, and neurology.Conclusions: The findings show that all three LLMs achieved over 60% of the maximum possible score, indicating their potential applicability in clinical practice. However, inaccuracies that could lead to adverse decisions underscore the need for caution in their application. Med-Go's superior performance highlights the benefits of incorporating specialized medical knowledge into LLMs training. It is anticipated that further development and refinement of medical LLMs will enhance their precision and safety in clinical use.","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"23"},"PeriodicalIF":3.5000,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases.\",\"authors\":\"Xueqi Wang, Haiyan Ye, Sumian Zhang, Mei Yang, Xuebin Wang\",\"doi\":\"10.1007/s10916-025-02152-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Generative large language models (LLMs) are increasingly integrated into the medical field. However, their actual efficacy in clinical decision-making remains partially unexplored. This study aimed to assess the performance of the three LLMs, ChatGPT-4, Gemini, and Med-Go, in the domain of professional medicine when confronted with actual clinical cases.Methods: This study involved 134 clinical cases spanning nine medical disciplines. Each LLM was required to provide suggestions for diagnosis, diagnostic criteria, differential diagnosis, examination and treatment for every case. Responses were scored by two experts using a predefined rubric.Results: In overall performance among the models, Med-Go achieved the highest median score (37.5, IQR 31.9-41.5), while Gemini recorded the lowest (33.0, IQR 25.5-36.6), showing significant statistical difference among the three LLMs (p < 0.001). Analysis revealed that responses related to differential diagnosis were the weakest, while those pertaining to treatment recommendations were the strongest. Med-Go displayed notable performance advantages in gastroenterology, nephrology, and neurology.Conclusions: The findings show that all three LLMs achieved over 60% of the maximum possible score, indicating their potential applicability in clinical practice. However, inaccuracies that could lead to adverse decisions underscore the need for caution in their application. Med-Go's superior performance highlights the benefits of incorporating specialized medical knowledge into LLMs training. It is anticipated that further development and refinement of medical LLMs will enhance their precision and safety in clinical use.\",\"PeriodicalId\":16338,\"journal\":{\"name\":\"Journal of Medical Systems\",\"volume\":\"49 1\",\"pages\":\"23\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-02-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Systems\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s10916-025-02152-9\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10916-025-02152-9","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景：生成式大语言模型（LLMs）越来越多地应用于医学领域。然而，它们在临床决策中的实际功效仍部分未被探索。本研究旨在评估三位LLMs， ChatGPT-4， Gemini和Med-Go在面对实际临床病例时在专业医学领域的表现。方法：本研究涉及9个医学学科的134例临床病例。每个LLM都需要对每个病例的诊断、诊断标准、鉴别诊断、检查和治疗提出建议。回答由两位专家使用预定义的标准评分。结果：在所有模型中，Med-Go的中位评分最高（37.5分，IQR 31.9-41.5分），Gemini的中位评分最低（33.0分，IQR 25.5-36.6分），三种LLMs的中位评分差异有统计学意义(p)。结论：三种LLMs的中位评分均达到最大可能评分的60%以上，表明其在临床实践中具有潜在的适用性。然而，可能导致不利决定的不准确性强调了在应用它们时需要谨慎。Med-Go卓越的表现突出了将专业医学知识纳入法学硕士培训的好处。预计医学法学硕士的进一步发展和完善将提高其临床使用的准确性和安全性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases.

Background: Generative large language models (LLMs) are increasingly integrated into the medical field. However, their actual efficacy in clinical decision-making remains partially unexplored. This study aimed to assess the performance of the three LLMs, ChatGPT-4, Gemini, and Med-Go, in the domain of professional medicine when confronted with actual clinical cases.

Methods: This study involved 134 clinical cases spanning nine medical disciplines. Each LLM was required to provide suggestions for diagnosis, diagnostic criteria, differential diagnosis, examination and treatment for every case. Responses were scored by two experts using a predefined rubric.

Results: In overall performance among the models, Med-Go achieved the highest median score (37.5, IQR 31.9-41.5), while Gemini recorded the lowest (33.0, IQR 25.5-36.6), showing significant statistical difference among the three LLMs (p < 0.001). Analysis revealed that responses related to differential diagnosis were the weakest, while those pertaining to treatment recommendations were the strongest. Med-Go displayed notable performance advantages in gastroenterology, nephrology, and neurology.

Conclusions: The findings show that all three LLMs achieved over 60% of the maximum possible score, indicating their potential applicability in clinical practice. However, inaccuracies that could lead to adverse decisions underscore the need for caution in their application. Med-Go's superior performance highlights the benefits of incorporating specialized medical knowledge into LLMs training. It is anticipated that further development and refinement of medical LLMs will enhance their precision and safety in clinical use.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Medical Systems 医学-卫生保健

CiteScore

11.60

自引率

1.90%

发文量

审稿时长

4.8 months

期刊介绍： Journal of Medical Systems provides a forum for the presentation and discussion of the increasingly extensive applications of new systems techniques and methods in hospital clinic and physician''s office administration; pathology radiology and pharmaceutical delivery systems; medical records storage and retrieval; and ancillary patient-support systems. The journal publishes informative articles essays and studies across the entire scale of medical systems from large hospital programs to novel small-scale medical services. Education is an integral part of this amalgamation of sciences and selected articles are published in this area. Since existing medical systems are constantly being modified to fit particular circumstances and to solve specific problems the journal includes a special section devoted to status reports on current installations.