大语言模型在鼻咽癌患者TN分期及治疗反应评价中的应用:chatgpt - 40 - latest与DeepSeek-V3-0324的性能比较分析

IF 3.5 2区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Yifan Yang, Fan Yang, Shengsheng Xiao, Kaiqi Hou, Kexin Chen, Zaiyi Liu, Changhong Liang, Xiaobo Chen, Guangyi Wang
{"title":"大语言模型在鼻咽癌患者TN分期及治疗反应评价中的应用:chatgpt - 40 - latest与DeepSeek-V3-0324的性能比较分析","authors":"Yifan Yang, Fan Yang, Shengsheng Xiao, Kaiqi Hou, Kexin Chen, Zaiyi Liu, Changhong Liang, Xiaobo Chen, Guangyi Wang","doi":"10.1002/jmri.70140","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Accurate tumor staging and treatment response evaluation (TRE) are critical for nasopharyngeal carcinoma (NPC) clinical decisions. Conventional methods relying on manual imaging analysis are expertise-dependent, time-consuming, and prone to inter-observer variability and errors.</p><p><strong>Purpose: </strong>To assess the performance of two large language models (LLMs): ChatGPT-4o-latest and DeepSeek-V3-0324 in automating T, N staging and TRE for NPC patients.</p><p><strong>Study type: </strong>Retrospective.</p><p><strong>Population: </strong>Three hundred seven NPC patients from three centers (mean age: 45.5 ± 11.3 years; 216 men, 91 women).</p><p><strong>Field strength/sequence: </strong>All imaging was conducted using 3.0T or 1.5T scanners. The imaging sequence included axial T1-weighted fast spin-echo, T2-weighted fast spin-echo, T2-weighted fat-suppressed spin-echo, and Contrast-Enhanced T1-weighted fast spin-echo.</p><p><strong>Assessment: </strong>Two radiologists established the reference standards for TN staging at baseline and for TRE at two time points: post-induction chemotherapy (TRE-1) and post-concurrent chemoradiotherapy (TRE-2), based on the 9th version of AJCC/UICC guidelines and the RECIST1.1 criteria. LLMs were via few-shot chain-of-thought prompting and tested on 277 patients with 831 reports. Additionally, four radiologists independently assessed 68 cases both with and without the assistance of LLMs and compared the performance and efficiency in both conditions.</p><p><strong>Statistical tests: </strong>McNemar-Bowker test, Wilcoxon signed-rank test. p < 0.05 was considered statistically significant.</p><p><strong>Results: </strong>DeepSeek-V3-0324 significantly outperformed GPT-4o-latest in TRE-1 staging (96.5% vs. 82.9%, p < 0.001). For T staging (95.3% vs. 93.5%, p = 0.24), N staging (93.8% vs. 89.6%, p = 0.265), and TRE-2 (94.9% vs. 93.2%, p = 0.556), the accuracy between DeepSeek-V3-0324 and ChatGPT-4o-latest showed no significant difference. DeepSeek-V3-0324 also showed stronger agreement with expert annotation (κ = 0.85-0.90), compared to ChatGPT-4o-latest (κ = 0.49-0.86). Significant improvements in time efficiency were observed across all radiologists with LLM assistance (p < 0.001).</p><p><strong>Data conclusion: </strong>LLMs, particularly DeepSeek-V3-0324, can automate NPC TN staging and TRE with high accuracy, enhancing clinical efficiency. LLMs integration may improve diagnostic consistency, especially for junior clinicians.</p><p><strong>Level of evidence: 3: </strong></p><p><strong>Technical efficacy: </strong>Stage 4.</p>","PeriodicalId":16140,"journal":{"name":"Journal of Magnetic Resonance Imaging","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Application of Large Language Models in TN Staging and Treatment Response Evaluation for Patients With Nasopharyngeal Carcinoma: A Comparative Performance Analysis of ChatGPT-4o-Latest and DeepSeek-V3-0324.\",\"authors\":\"Yifan Yang, Fan Yang, Shengsheng Xiao, Kaiqi Hou, Kexin Chen, Zaiyi Liu, Changhong Liang, Xiaobo Chen, Guangyi Wang\",\"doi\":\"10.1002/jmri.70140\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Accurate tumor staging and treatment response evaluation (TRE) are critical for nasopharyngeal carcinoma (NPC) clinical decisions. Conventional methods relying on manual imaging analysis are expertise-dependent, time-consuming, and prone to inter-observer variability and errors.</p><p><strong>Purpose: </strong>To assess the performance of two large language models (LLMs): ChatGPT-4o-latest and DeepSeek-V3-0324 in automating T, N staging and TRE for NPC patients.</p><p><strong>Study type: </strong>Retrospective.</p><p><strong>Population: </strong>Three hundred seven NPC patients from three centers (mean age: 45.5 ± 11.3 years; 216 men, 91 women).</p><p><strong>Field strength/sequence: </strong>All imaging was conducted using 3.0T or 1.5T scanners. The imaging sequence included axial T1-weighted fast spin-echo, T2-weighted fast spin-echo, T2-weighted fat-suppressed spin-echo, and Contrast-Enhanced T1-weighted fast spin-echo.</p><p><strong>Assessment: </strong>Two radiologists established the reference standards for TN staging at baseline and for TRE at two time points: post-induction chemotherapy (TRE-1) and post-concurrent chemoradiotherapy (TRE-2), based on the 9th version of AJCC/UICC guidelines and the RECIST1.1 criteria. LLMs were via few-shot chain-of-thought prompting and tested on 277 patients with 831 reports. Additionally, four radiologists independently assessed 68 cases both with and without the assistance of LLMs and compared the performance and efficiency in both conditions.</p><p><strong>Statistical tests: </strong>McNemar-Bowker test, Wilcoxon signed-rank test. p < 0.05 was considered statistically significant.</p><p><strong>Results: </strong>DeepSeek-V3-0324 significantly outperformed GPT-4o-latest in TRE-1 staging (96.5% vs. 82.9%, p < 0.001). For T staging (95.3% vs. 93.5%, p = 0.24), N staging (93.8% vs. 89.6%, p = 0.265), and TRE-2 (94.9% vs. 93.2%, p = 0.556), the accuracy between DeepSeek-V3-0324 and ChatGPT-4o-latest showed no significant difference. DeepSeek-V3-0324 also showed stronger agreement with expert annotation (κ = 0.85-0.90), compared to ChatGPT-4o-latest (κ = 0.49-0.86). Significant improvements in time efficiency were observed across all radiologists with LLM assistance (p < 0.001).</p><p><strong>Data conclusion: </strong>LLMs, particularly DeepSeek-V3-0324, can automate NPC TN staging and TRE with high accuracy, enhancing clinical efficiency. LLMs integration may improve diagnostic consistency, especially for junior clinicians.</p><p><strong>Level of evidence: 3: </strong></p><p><strong>Technical efficacy: </strong>Stage 4.</p>\",\"PeriodicalId\":16140,\"journal\":{\"name\":\"Journal of Magnetic Resonance Imaging\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Magnetic Resonance Imaging\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/jmri.70140\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Magnetic Resonance Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/jmri.70140","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

摘要

背景:准确的肿瘤分期和治疗反应评估(TRE)对鼻咽癌(NPC)的临床决策至关重要。依靠人工成像分析的传统方法依赖于专业知识,耗时,并且容易出现观察者之间的差异和错误。目的:评估两种大型语言模型(LLMs): chatgpt - 40 -latest和DeepSeek-V3-0324在鼻咽癌患者T、N分期和TRE自动化中的性能。研究类型:回顾性。人群:来自三个中心的307例NPC患者(平均年龄:45.5±11.3岁;男性216例,女性91例)。场强/序列:所有成像均采用3.0T或1.5T扫描仪。成像顺序包括轴向t1加权快速自旋回波、t2加权快速自旋回波、t2加权脂肪抑制自旋回波、对比增强t1加权快速自旋回波。评估:两位放射科医生根据AJCC/UICC第9版指南和RECIST1.1标准,建立了基线时TN分期和诱导化疗后(tre1)和同步放化疗后(tre2)两个时间点的TRE的参考标准。llm通过几次思维链提示,在277名患者中进行了831份报告的测试。此外,四名放射科医生独立评估了68例有和没有llm辅助的病例,并比较了两种情况下的表现和效率。统计学检验:McNemar-Bowker检验、Wilcoxon sign -rank检验。p结果:DeepSeek-V3-0324在tre1分期方面的表现明显优于gpt - 40 -latest (96.5% vs. 82.9%)。p数据结论:LLMs特别是DeepSeek-V3-0324能够高精度地实现NPC TN分期和tre1的自动化,提高了临床效率。LLMs整合可以提高诊断的一致性,特别是对初级临床医生。证据水平:3;技术功效:第4阶段。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Application of Large Language Models in TN Staging and Treatment Response Evaluation for Patients With Nasopharyngeal Carcinoma: A Comparative Performance Analysis of ChatGPT-4o-Latest and DeepSeek-V3-0324.

Background: Accurate tumor staging and treatment response evaluation (TRE) are critical for nasopharyngeal carcinoma (NPC) clinical decisions. Conventional methods relying on manual imaging analysis are expertise-dependent, time-consuming, and prone to inter-observer variability and errors.

Purpose: To assess the performance of two large language models (LLMs): ChatGPT-4o-latest and DeepSeek-V3-0324 in automating T, N staging and TRE for NPC patients.

Study type: Retrospective.

Population: Three hundred seven NPC patients from three centers (mean age: 45.5 ± 11.3 years; 216 men, 91 women).

Field strength/sequence: All imaging was conducted using 3.0T or 1.5T scanners. The imaging sequence included axial T1-weighted fast spin-echo, T2-weighted fast spin-echo, T2-weighted fat-suppressed spin-echo, and Contrast-Enhanced T1-weighted fast spin-echo.

Assessment: Two radiologists established the reference standards for TN staging at baseline and for TRE at two time points: post-induction chemotherapy (TRE-1) and post-concurrent chemoradiotherapy (TRE-2), based on the 9th version of AJCC/UICC guidelines and the RECIST1.1 criteria. LLMs were via few-shot chain-of-thought prompting and tested on 277 patients with 831 reports. Additionally, four radiologists independently assessed 68 cases both with and without the assistance of LLMs and compared the performance and efficiency in both conditions.

Statistical tests: McNemar-Bowker test, Wilcoxon signed-rank test. p < 0.05 was considered statistically significant.

Results: DeepSeek-V3-0324 significantly outperformed GPT-4o-latest in TRE-1 staging (96.5% vs. 82.9%, p < 0.001). For T staging (95.3% vs. 93.5%, p = 0.24), N staging (93.8% vs. 89.6%, p = 0.265), and TRE-2 (94.9% vs. 93.2%, p = 0.556), the accuracy between DeepSeek-V3-0324 and ChatGPT-4o-latest showed no significant difference. DeepSeek-V3-0324 also showed stronger agreement with expert annotation (κ = 0.85-0.90), compared to ChatGPT-4o-latest (κ = 0.49-0.86). Significant improvements in time efficiency were observed across all radiologists with LLM assistance (p < 0.001).

Data conclusion: LLMs, particularly DeepSeek-V3-0324, can automate NPC TN staging and TRE with high accuracy, enhancing clinical efficiency. LLMs integration may improve diagnostic consistency, especially for junior clinicians.

Level of evidence: 3:

Technical efficacy: Stage 4.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
9.70
自引率
6.80%
发文量
494
审稿时长
2 months
期刊介绍: The Journal of Magnetic Resonance Imaging (JMRI) is an international journal devoted to the timely publication of basic and clinical research, educational and review articles, and other information related to the diagnostic applications of magnetic resonance.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信