大语言模型在鼻咽癌患者TN分期及治疗反应评价中的应用：chatgpt - 40 - latest与DeepSeek-V3-0324的性能比较分析

IF 3.5 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Journal of Magnetic Resonance Imaging Pub Date : 2025-10-04 DOI:10.1002/jmri.70140

Yifan Yang, Fan Yang, Shengsheng Xiao, Kaiqi Hou, Kexin Chen, Zaiyi Liu, Changhong Liang, Xiaobo Chen, Guangyi Wang

{"title":"大语言模型在鼻咽癌患者TN分期及治疗反应评价中的应用：chatgpt - 40 - latest与DeepSeek-V3-0324的性能比较分析","authors":"Yifan Yang, Fan Yang, Shengsheng Xiao, Kaiqi Hou, Kexin Chen, Zaiyi Liu, Changhong Liang, Xiaobo Chen, Guangyi Wang","doi":"10.1002/jmri.70140","DOIUrl":null,"url":null,"abstract":"Background: Accurate tumor staging and treatment response evaluation (TRE) are critical for nasopharyngeal carcinoma (NPC) clinical decisions. Conventional methods relying on manual imaging analysis are expertise-dependent, time-consuming, and prone to inter-observer variability and errors.Purpose: To assess the performance of two large language models (LLMs): ChatGPT-4o-latest and DeepSeek-V3-0324 in automating T, N staging and TRE for NPC patients.Study type: Retrospective.Population: Three hundred seven NPC patients from three centers (mean age: 45.5 ± 11.3 years; 216 men, 91 women).Field strength/sequence: All imaging was conducted using 3.0T or 1.5T scanners. The imaging sequence included axial T1-weighted fast spin-echo, T2-weighted fast spin-echo, T2-weighted fat-suppressed spin-echo, and Contrast-Enhanced T1-weighted fast spin-echo.Assessment: Two radiologists established the reference standards for TN staging at baseline and for TRE at two time points: post-induction chemotherapy (TRE-1) and post-concurrent chemoradiotherapy (TRE-2), based on the 9th version of AJCC/UICC guidelines and the RECIST1.1 criteria. LLMs were via few-shot chain-of-thought prompting and tested on 277 patients with 831 reports. Additionally, four radiologists independently assessed 68 cases both with and without the assistance of LLMs and compared the performance and efficiency in both conditions.Statistical tests: McNemar-Bowker test, Wilcoxon signed-rank test. p < 0.05 was considered statistically significant.Results: DeepSeek-V3-0324 significantly outperformed GPT-4o-latest in TRE-1 staging (96.5% vs. 82.9%, p < 0.001). For T staging (95.3% vs. 93.5%, p = 0.24), N staging (93.8% vs. 89.6%, p = 0.265), and TRE-2 (94.9% vs. 93.2%, p = 0.556), the accuracy between DeepSeek-V3-0324 and ChatGPT-4o-latest showed no significant difference. DeepSeek-V3-0324 also showed stronger agreement with expert annotation (κ = 0.85-0.90), compared to ChatGPT-4o-latest (κ = 0.49-0.86). Significant improvements in time efficiency were observed across all radiologists with LLM assistance (p < 0.001).Data conclusion: LLMs, particularly DeepSeek-V3-0324, can automate NPC TN staging and TRE with high accuracy, enhancing clinical efficiency. LLMs integration may improve diagnostic consistency, especially for junior clinicians.Level of evidence: 3: Technical efficacy: Stage 4.","PeriodicalId":16140,"journal":{"name":"Journal of Magnetic Resonance Imaging","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Application of Large Language Models in TN Staging and Treatment Response Evaluation for Patients With Nasopharyngeal Carcinoma: A Comparative Performance Analysis of ChatGPT-4o-Latest and DeepSeek-V3-0324.\",\"authors\":\"Yifan Yang, Fan Yang, Shengsheng Xiao, Kaiqi Hou, Kexin Chen, Zaiyi Liu, Changhong Liang, Xiaobo Chen, Guangyi Wang\",\"doi\":\"10.1002/jmri.70140\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Accurate tumor staging and treatment response evaluation (TRE) are critical for nasopharyngeal carcinoma (NPC) clinical decisions. Conventional methods relying on manual imaging analysis are expertise-dependent, time-consuming, and prone to inter-observer variability and errors.Purpose: To assess the performance of two large language models (LLMs): ChatGPT-4o-latest and DeepSeek-V3-0324 in automating T, N staging and TRE for NPC patients.Study type: Retrospective.Population: Three hundred seven NPC patients from three centers (mean age: 45.5 ± 11.3 years; 216 men, 91 women).Field strength/sequence: All imaging was conducted using 3.0T or 1.5T scanners. The imaging sequence included axial T1-weighted fast spin-echo, T2-weighted fast spin-echo, T2-weighted fat-suppressed spin-echo, and Contrast-Enhanced T1-weighted fast spin-echo.Assessment: Two radiologists established the reference standards for TN staging at baseline and for TRE at two time points: post-induction chemotherapy (TRE-1) and post-concurrent chemoradiotherapy (TRE-2), based on the 9th version of AJCC/UICC guidelines and the RECIST1.1 criteria. LLMs were via few-shot chain-of-thought prompting and tested on 277 patients with 831 reports. Additionally, four radiologists independently assessed 68 cases both with and without the assistance of LLMs and compared the performance and efficiency in both conditions.Statistical tests: McNemar-Bowker test, Wilcoxon signed-rank test. p < 0.05 was considered statistically significant.Results: DeepSeek-V3-0324 significantly outperformed GPT-4o-latest in TRE-1 staging (96.5% vs. 82.9%, p < 0.001). For T staging (95.3% vs. 93.5%, p = 0.24), N staging (93.8% vs. 89.6%, p = 0.265), and TRE-2 (94.9% vs. 93.2%, p = 0.556), the accuracy between DeepSeek-V3-0324 and ChatGPT-4o-latest showed no significant difference. DeepSeek-V3-0324 also showed stronger agreement with expert annotation (κ = 0.85-0.90), compared to ChatGPT-4o-latest (κ = 0.49-0.86). Significant improvements in time efficiency were observed across all radiologists with LLM assistance (p < 0.001).Data conclusion: LLMs, particularly DeepSeek-V3-0324, can automate NPC TN staging and TRE with high accuracy, enhancing clinical efficiency. LLMs integration may improve diagnostic consistency, especially for junior clinicians.Level of evidence: 3: Technical efficacy: Stage 4.\",\"PeriodicalId\":16140,\"journal\":{\"name\":\"Journal of Magnetic Resonance Imaging\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Magnetic Resonance Imaging\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/jmri.70140\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Magnetic Resonance Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/jmri.70140","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

背景：准确的肿瘤分期和治疗反应评估（TRE）对鼻咽癌（NPC）的临床决策至关重要。依靠人工成像分析的传统方法依赖于专业知识，耗时，并且容易出现观察者之间的差异和错误。目的：评估两种大型语言模型（LLMs）： chatgpt - 40 -latest和DeepSeek-V3-0324在鼻咽癌患者T、N分期和TRE自动化中的性能。研究类型：回顾性。人群：来自三个中心的307例NPC患者（平均年龄：45.5±11.3岁；男性216例，女性91例）。场强/序列：所有成像均采用3.0T或1.5T扫描仪。成像顺序包括轴向t1加权快速自旋回波、t2加权快速自旋回波、t2加权脂肪抑制自旋回波、对比增强t1加权快速自旋回波。评估：两位放射科医生根据AJCC/UICC第9版指南和RECIST1.1标准，建立了基线时TN分期和诱导化疗后（tre1）和同步放化疗后（tre2）两个时间点的TRE的参考标准。llm通过几次思维链提示，在277名患者中进行了831份报告的测试。此外，四名放射科医生独立评估了68例有和没有llm辅助的病例，并比较了两种情况下的表现和效率。统计学检验：McNemar-Bowker检验、Wilcoxon sign -rank检验。p结果：DeepSeek-V3-0324在tre1分期方面的表现明显优于gpt - 40 -latest （96.5% vs. 82.9%）。p数据结论：LLMs特别是DeepSeek-V3-0324能够高精度地实现NPC TN分期和tre1的自动化，提高了临床效率。LLMs整合可以提高诊断的一致性，特别是对初级临床医生。证据水平：3；技术功效：第4阶段。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Application of Large Language Models in TN Staging and Treatment Response Evaluation for Patients With Nasopharyngeal Carcinoma: A Comparative Performance Analysis of ChatGPT-4o-Latest and DeepSeek-V3-0324.

Background: Accurate tumor staging and treatment response evaluation (TRE) are critical for nasopharyngeal carcinoma (NPC) clinical decisions. Conventional methods relying on manual imaging analysis are expertise-dependent, time-consuming, and prone to inter-observer variability and errors.

Purpose: To assess the performance of two large language models (LLMs): ChatGPT-4o-latest and DeepSeek-V3-0324 in automating T, N staging and TRE for NPC patients.

Study type: Retrospective.

Population: Three hundred seven NPC patients from three centers (mean age: 45.5 ± 11.3 years; 216 men, 91 women).

Field strength/sequence: All imaging was conducted using 3.0T or 1.5T scanners. The imaging sequence included axial T1-weighted fast spin-echo, T2-weighted fast spin-echo, T2-weighted fat-suppressed spin-echo, and Contrast-Enhanced T1-weighted fast spin-echo.

Assessment: Two radiologists established the reference standards for TN staging at baseline and for TRE at two time points: post-induction chemotherapy (TRE-1) and post-concurrent chemoradiotherapy (TRE-2), based on the 9th version of AJCC/UICC guidelines and the RECIST1.1 criteria. LLMs were via few-shot chain-of-thought prompting and tested on 277 patients with 831 reports. Additionally, four radiologists independently assessed 68 cases both with and without the assistance of LLMs and compared the performance and efficiency in both conditions.

Statistical tests: McNemar-Bowker test, Wilcoxon signed-rank test. p < 0.05 was considered statistically significant.

Results: DeepSeek-V3-0324 significantly outperformed GPT-4o-latest in TRE-1 staging (96.5% vs. 82.9%, p < 0.001). For T staging (95.3% vs. 93.5%, p = 0.24), N staging (93.8% vs. 89.6%, p = 0.265), and TRE-2 (94.9% vs. 93.2%, p = 0.556), the accuracy between DeepSeek-V3-0324 and ChatGPT-4o-latest showed no significant difference. DeepSeek-V3-0324 also showed stronger agreement with expert annotation (κ = 0.85-0.90), compared to ChatGPT-4o-latest (κ = 0.49-0.86). Significant improvements in time efficiency were observed across all radiologists with LLM assistance (p < 0.001).

Data conclusion: LLMs, particularly DeepSeek-V3-0324, can automate NPC TN staging and TRE with high accuracy, enhancing clinical efficiency. LLMs integration may improve diagnostic consistency, especially for junior clinicians.

Level of evidence: 3:

Technical efficacy: Stage 4.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Magnetic Resonance Imaging 医学-核医学

CiteScore

9.70

自引率

6.80%

发文量

494

审稿时长

2 months

期刊介绍： The Journal of Magnetic Resonance Imaging (JMRI) is an international journal devoted to the timely publication of basic and clinical research, educational and review articles, and other information related to the diagnostic applications of magnetic resonance.