{"title":"Application of Large Language Models in Complex Clinical Cases: Cross-Sectional Evaluation Study.","authors":"Yuanheng Huang, Guozhen Yang, Yahui Shen, Huiguo Chen, Weibin Wu, Xiaojun Li, Yonghui Wu, Kai Zhang, Jiannan Xu, Jian Zhang","doi":"10.2196/73941","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have made significant advancements in natural language processing (NLP) and are gradually showing potential for application in the medical field. However, LLMs still face challenges in medicine.</p><p><strong>Objective: </strong>This study aims to evaluate the efficiency, accuracy, and cost of LLMs in handling complex medical cases and to assess their potential and applicability as tools for clinical decision support.</p><p><strong>Methods: </strong>We selected cases from the database of the Department of Cardiothoracic Surgery, the Third Affiliated Hospital of Sun Yat-sen University (2021-2024), and conducted a multidimensional preliminary evaluation of the latest LLMs in clinical decision-making for complex cases. The evaluation included measuring the time taken for the LLMs to generate decision recommendations, Likert scores, and calculating decision costs to assess the execution efficiency, accuracy, and cost-effectiveness of the models.</p><p><strong>Results: </strong>A total of 80 complex cases were included in this study, and the performance of multiple LLMs in clinical decision-making was evaluated. Experts required 33.60 minutes on average (95% CI 32.57-34.63), far longer than any LLM. GPTo1 (0.71, 95% CI 0.67-0.74), GPT4o (0.88, 95% CI 0.83-0.92), and Deepseek (0.94, 95% CI 0.90-0.96) all finished under a minute without statistical differences. Although Kimi, Gemini, LLaMa3-8B, and LLaMa3-70B took 1.02-3.20 minutes, they were still faster than experts. In terms of decision accuracy, Deepseek-R1 had the highest accuracy (mean Likert score=4.19), with no significant difference compared to GPTo1 (P=.699), and both performed significantly better than GPT4o, Kimi, Gemini, LLaMa3-70B, and LLaMa3-8B (P<.001). Deepseek-R1 and GPTo1 demonstrated the lowest hallucination rates-6/80 (8%) and 5/80 (6%), respectively-significantly outperforming GPT-4o (7/80, 9%), Kimi (10/80, 12%), and the Gemini and LLaMa3 models, which exhibited substantially higher rates ranging from 13/80 (16%) to 25/80 (31%). Regarding decision costs, all LLMs showed significantly lower costs than the Multidisciplinary Team, with open-source models such as Deepseek-R1 offering a zero direct cost advantage.</p><p><strong>Conclusions: </strong>GPTo1 and Deepseek-R1 show strong clinical potential, boosting efficiency, maintaining accuracy, and reducing costs. GPT4o and Kimi performed moderately, indicating suitability for broader clinical tasks. Further research is needed to validate LLaMa3 series and Gemini in clinical decision.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e73941"},"PeriodicalIF":3.8000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/73941","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large language models (LLMs) have made significant advancements in natural language processing (NLP) and are gradually showing potential for application in the medical field. However, LLMs still face challenges in medicine.
Objective: This study aims to evaluate the efficiency, accuracy, and cost of LLMs in handling complex medical cases and to assess their potential and applicability as tools for clinical decision support.
Methods: We selected cases from the database of the Department of Cardiothoracic Surgery, the Third Affiliated Hospital of Sun Yat-sen University (2021-2024), and conducted a multidimensional preliminary evaluation of the latest LLMs in clinical decision-making for complex cases. The evaluation included measuring the time taken for the LLMs to generate decision recommendations, Likert scores, and calculating decision costs to assess the execution efficiency, accuracy, and cost-effectiveness of the models.
Results: A total of 80 complex cases were included in this study, and the performance of multiple LLMs in clinical decision-making was evaluated. Experts required 33.60 minutes on average (95% CI 32.57-34.63), far longer than any LLM. GPTo1 (0.71, 95% CI 0.67-0.74), GPT4o (0.88, 95% CI 0.83-0.92), and Deepseek (0.94, 95% CI 0.90-0.96) all finished under a minute without statistical differences. Although Kimi, Gemini, LLaMa3-8B, and LLaMa3-70B took 1.02-3.20 minutes, they were still faster than experts. In terms of decision accuracy, Deepseek-R1 had the highest accuracy (mean Likert score=4.19), with no significant difference compared to GPTo1 (P=.699), and both performed significantly better than GPT4o, Kimi, Gemini, LLaMa3-70B, and LLaMa3-8B (P<.001). Deepseek-R1 and GPTo1 demonstrated the lowest hallucination rates-6/80 (8%) and 5/80 (6%), respectively-significantly outperforming GPT-4o (7/80, 9%), Kimi (10/80, 12%), and the Gemini and LLaMa3 models, which exhibited substantially higher rates ranging from 13/80 (16%) to 25/80 (31%). Regarding decision costs, all LLMs showed significantly lower costs than the Multidisciplinary Team, with open-source models such as Deepseek-R1 offering a zero direct cost advantage.
Conclusions: GPTo1 and Deepseek-R1 show strong clinical potential, boosting efficiency, maintaining accuracy, and reducing costs. GPT4o and Kimi performed moderately, indicating suitability for broader clinical tasks. Further research is needed to validate LLaMa3 series and Gemini in clinical decision.
背景:大型语言模型(Large language models, llm)在自然语言处理(natural language processing, NLP)领域取得了重大进展,并逐渐显示出在医学领域的应用潜力。然而,法学硕士仍然面临着医学方面的挑战。目的:本研究旨在评估llm处理复杂医疗病例的效率、准确性和成本,并评估其作为临床决策支持工具的潜力和适用性。方法:选取中山大学附属第三医院心胸外科数据库(2021-2024)病例,对最新LLMs在复杂病例临床决策中的应用进行多维度初步评价。评估包括测量llm生成决策建议所花费的时间、Likert分数,以及计算决策成本以评估模型的执行效率、准确性和成本效益。结果:本研究共纳入80例复杂病例,评估了多名LLMs在临床决策中的表现。专家平均需要33.60分钟(95% CI 32.57-34.63),远远超过任何LLM。GPTo1 (0.71, 95% CI 0.67-0.74)、gpt40 (0.88, 95% CI 0.83-0.92)和Deepseek (0.94, 95% CI 0.90-0.96)均在1分钟内完成,无统计学差异。虽然“Kimi”、“Gemini”、“LLaMa3-8B”、“LLaMa3-70B”用时1.02-3.20分钟,但仍然比专家快。在决策准确性方面,Deepseek-R1的准确率最高(平均Likert评分=4.19),与GPTo1相比无显著差异(P= 0.699),且均显著优于gpt40、Kimi、Gemini、LLaMa3-70B和LLaMa3-8B(结论:GPTo1和Deepseek-R1在提高效率、保持准确性和降低成本方面具有较强的临床潜力。gpt40和Kimi表现中等,表明适用于更广泛的临床任务。需要进一步的研究来验证LLaMa3系列和Gemini在临床决策中的有效性。
期刊介绍:
JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals.
Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.