A Comparative Study of Five Large Language Models' Response for Liver Cancer Comprehensive Treatment.

IF 3.4 3区医学 Q2 ONCOLOGY

Journal of Hepatocellular Carcinoma Pub Date : 2025-08-20 eCollection Date: 2025-01-01 DOI:10.2147/JHC.S531642

Deyuan Zhong, Yuxin Liang, Hong-Tao Yan, Xinpei Chen, Qinyan Yang, Shuoshuo Ma, Yuhao Su, YaHui Chen, Xiaolun Huang, Ming Wang

{"title":"A Comparative Study of Five Large Language Models' Response for Liver Cancer Comprehensive Treatment.","authors":"Deyuan Zhong, Yuxin Liang, Hong-Tao Yan, Xinpei Chen, Qinyan Yang, Shuoshuo Ma, Yuhao Su, YaHui Chen, Xiaolun Huang, Ming Wang","doi":"10.2147/JHC.S531642","DOIUrl":null,"url":null,"abstract":"Introduction: Large language models (LLMs) are increasingly used in healthcare, yet their reliability in specialized clinical fields remains uncertain. Liver cancer, as a complex and high-burden disease, poses unique challenges for AI-based tools. This study aimed to evaluate the comprehensibility and clinical applicability of five mainstream LLMs in addressing liver cancer-related clinical questions.Methods: We developed 90 standardized questions covering multiple aspects of liver cancer management. Five LLMs-GPT-4, Gemini, Copilot, Kimi, and Ernie Bot-were evaluated in a blinded fashion by three independent hepatobiliary experts. Responses were scored using predefined criteria for comprehensibility and clinical applicability. Overall group comparisons were conducted using the Fisher-Freeman-Halton test (for categorical data) and the Kruskal-Wallis test (for ordinal scores), followed by Dunn's post-hoc test or Fisher's exact test with Bonferroni correction. Inter-rater reliability was assessed using Fleiss' kappa.Results: Kimi and GPT-4 achieved the highest proportions of fully applicable responses (68% and 62%, respectively), while Ernie Bot and Copilot showed the lowest. Comprehensibility was generally high, with Kimi and Ernie Bot scoring over 98%. However, none of the LLMs consistently provided guideline-concordant answers to all questions. Performance on professional-level questions was significantly lower than on common-sense ones, highlighting deficiencies in complex clinical reasoning.Conclusion: LLMs demonstrate varied performance in liver cancer-related queries. While GPT-4 and Kimi show promise in clinical applicability, limitations in accuracy and consistency-particularly for complex medical decisions-underscore the need for domain-specific optimization before clinical integration.Trial registration: Not applicable.","PeriodicalId":15906,"journal":{"name":"Journal of Hepatocellular Carcinoma","volume":"12 ","pages":"1861-1871"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12375359/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Hepatocellular Carcinoma","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2147/JHC.S531642","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Large language models (LLMs) are increasingly used in healthcare, yet their reliability in specialized clinical fields remains uncertain. Liver cancer, as a complex and high-burden disease, poses unique challenges for AI-based tools. This study aimed to evaluate the comprehensibility and clinical applicability of five mainstream LLMs in addressing liver cancer-related clinical questions.

Methods: We developed 90 standardized questions covering multiple aspects of liver cancer management. Five LLMs-GPT-4, Gemini, Copilot, Kimi, and Ernie Bot-were evaluated in a blinded fashion by three independent hepatobiliary experts. Responses were scored using predefined criteria for comprehensibility and clinical applicability. Overall group comparisons were conducted using the Fisher-Freeman-Halton test (for categorical data) and the Kruskal-Wallis test (for ordinal scores), followed by Dunn's post-hoc test or Fisher's exact test with Bonferroni correction. Inter-rater reliability was assessed using Fleiss' kappa.

Results: Kimi and GPT-4 achieved the highest proportions of fully applicable responses (68% and 62%, respectively), while Ernie Bot and Copilot showed the lowest. Comprehensibility was generally high, with Kimi and Ernie Bot scoring over 98%. However, none of the LLMs consistently provided guideline-concordant answers to all questions. Performance on professional-level questions was significantly lower than on common-sense ones, highlighting deficiencies in complex clinical reasoning.

Conclusion: LLMs demonstrate varied performance in liver cancer-related queries. While GPT-4 and Kimi show promise in clinical applicability, limitations in accuracy and consistency-particularly for complex medical decisions-underscore the need for domain-specific optimization before clinical integration.

Trial registration: Not applicable.

Abstract Image

查看原文本刊更多论文

五种大型语言模型在肝癌综合治疗中的疗效比较研究。

大型语言模型（llm）越来越多地用于医疗保健，但其在专业临床领域的可靠性仍然不确定。肝癌作为一种复杂的高负担疾病，对基于人工智能的工具提出了独特的挑战。本研究旨在评估五个主流法学硕士在解决肝癌相关临床问题中的可理解性和临床适用性。方法：我们制定了90个标准化问题，涵盖肝癌管理的多个方面。五位LLMs-GPT-4， Gemini, Copilot， Kimi和Ernie bot -由三位独立的肝胆专家以盲法评估。使用预先定义的可理解性和临床适用性标准对反应进行评分。总体组间比较采用Fisher- freeman - halton检验（分类数据）和Kruskal-Wallis检验（序数分数），随后采用Dunn事后检验或Fisher精确检验（Bonferroni校正）。评估信度采用Fleiss’kappa法。结果：Kimi和GPT-4的完全适用反应比例最高（分别为68%和62%），而Ernie Bot和Copilot的完全适用反应比例最低。可理解性普遍较高，基米和厄尼·博特得分超过98%。然而，没有一个法学硕士始终如一地为所有问题提供与指南一致的答案。他们在专业水平问题上的表现明显低于常识性问题，这突出了他们在复杂临床推理方面的不足。结论：法学硕士在肝癌相关查询中表现出不同的性能。虽然GPT-4和Kimi在临床应用方面表现出了希望，但在准确性和一致性方面的局限性——特别是在复杂的医疗决策方面——强调了在临床整合之前对特定领域进行优化的必要性。试验注册：不适用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊