Jun Xu, Junjie Wang, Junjun Li, Zhangxiang Zhu, Xiao Fu, Wei Cai, Ruipeng Song, Tengfei Wang, Hai Li
{"title":"Predicting Immunotherapy Response in Unresectable Hepatocellular Carcinoma: A Comparative Study of Large Language Models and Human Experts.","authors":"Jun Xu, Junjie Wang, Junjun Li, Zhangxiang Zhu, Xiao Fu, Wei Cai, Ruipeng Song, Tengfei Wang, Hai Li","doi":"10.1007/s10916-025-02192-1","DOIUrl":null,"url":null,"abstract":"<p><p>Hepatocellular carcinoma (HCC) is an aggressive cancer with limited biomarkers for predicting immunotherapy response. Recent advancements in large language models (LLMs) like GPT-4, GPT-4o, and Gemini offer the potential for enhancing clinical decision-making through multimodal data analysis. However, their effectiveness in predicting immunotherapy response, especially compared to human experts, remains unclear. This study assessed the performance of GPT-4, GPT-4o, and Gemini in predicting immunotherapy response in unresectable HCC, compared to radiologists and oncologists of varying expertise. A retrospective analysis of 186 patients with unresectable HCC utilized multimodal data (clinical and CT images). LLMs were evaluated with zero-shot prompting and two strategies: the 'voting method' and the 'OR rule method' for improved sensitivity. Performance metrics included accuracy, sensitivity, area under the curve (AUC), and agreement across LLMs and physicians.GPT-4o, using the 'OR rule method,' achieved 65% accuracy and 47% sensitivity, comparable to intermediate physicians but lower than senior physicians (accuracy: 72%, p = 0.045; sensitivity: 70%, p < 0.0001). Gemini-GPT, combining GPT-4, GPT-4o, and Gemini, achieved an AUC of 0.69, similar to senior physicians (AUC: 0.72, p = 0.35), with 68% accuracy, outperforming junior and intermediate physicians while remaining comparable to senior physicians (p = 0.78). However, its sensitivity (58%) was lower than senior physicians (p = 0.0097). LLMs demonstrated higher inter-model agreement (κ = 0.59-0.70) than inter-physician agreement, especially among junior physicians (κ = 0.15). This study highlights the potential of LLMs, particularly Gemini-GPT, as valuable tools in predicting immunotherapy response for HCC.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"64"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10916-025-02192-1","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Hepatocellular carcinoma (HCC) is an aggressive cancer with limited biomarkers for predicting immunotherapy response. Recent advancements in large language models (LLMs) like GPT-4, GPT-4o, and Gemini offer the potential for enhancing clinical decision-making through multimodal data analysis. However, their effectiveness in predicting immunotherapy response, especially compared to human experts, remains unclear. This study assessed the performance of GPT-4, GPT-4o, and Gemini in predicting immunotherapy response in unresectable HCC, compared to radiologists and oncologists of varying expertise. A retrospective analysis of 186 patients with unresectable HCC utilized multimodal data (clinical and CT images). LLMs were evaluated with zero-shot prompting and two strategies: the 'voting method' and the 'OR rule method' for improved sensitivity. Performance metrics included accuracy, sensitivity, area under the curve (AUC), and agreement across LLMs and physicians.GPT-4o, using the 'OR rule method,' achieved 65% accuracy and 47% sensitivity, comparable to intermediate physicians but lower than senior physicians (accuracy: 72%, p = 0.045; sensitivity: 70%, p < 0.0001). Gemini-GPT, combining GPT-4, GPT-4o, and Gemini, achieved an AUC of 0.69, similar to senior physicians (AUC: 0.72, p = 0.35), with 68% accuracy, outperforming junior and intermediate physicians while remaining comparable to senior physicians (p = 0.78). However, its sensitivity (58%) was lower than senior physicians (p = 0.0097). LLMs demonstrated higher inter-model agreement (κ = 0.59-0.70) than inter-physician agreement, especially among junior physicians (κ = 0.15). This study highlights the potential of LLMs, particularly Gemini-GPT, as valuable tools in predicting immunotherapy response for HCC.
期刊介绍:
Journal of Medical Systems provides a forum for the presentation and discussion of the increasingly extensive applications of new systems techniques and methods in hospital clinic and physician''s office administration; pathology radiology and pharmaceutical delivery systems; medical records storage and retrieval; and ancillary patient-support systems. The journal publishes informative articles essays and studies across the entire scale of medical systems from large hospital programs to novel small-scale medical services. Education is an integral part of this amalgamation of sciences and selected articles are published in this area. Since existing medical systems are constantly being modified to fit particular circumstances and to solve specific problems the journal includes a special section devoted to status reports on current installations.