Yifan Yang, Fan Yang, Shengsheng Xiao, Kaiqi Hou, Kexin Chen, Zaiyi Liu, Changhong Liang, Xiaobo Chen, Guangyi Wang
{"title":"大语言模型在鼻咽癌患者TN分期及治疗反应评价中的应用:chatgpt - 40 - latest与DeepSeek-V3-0324的性能比较分析","authors":"Yifan Yang, Fan Yang, Shengsheng Xiao, Kaiqi Hou, Kexin Chen, Zaiyi Liu, Changhong Liang, Xiaobo Chen, Guangyi Wang","doi":"10.1002/jmri.70140","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Accurate tumor staging and treatment response evaluation (TRE) are critical for nasopharyngeal carcinoma (NPC) clinical decisions. Conventional methods relying on manual imaging analysis are expertise-dependent, time-consuming, and prone to inter-observer variability and errors.</p><p><strong>Purpose: </strong>To assess the performance of two large language models (LLMs): ChatGPT-4o-latest and DeepSeek-V3-0324 in automating T, N staging and TRE for NPC patients.</p><p><strong>Study type: </strong>Retrospective.</p><p><strong>Population: </strong>Three hundred seven NPC patients from three centers (mean age: 45.5 ± 11.3 years; 216 men, 91 women).</p><p><strong>Field strength/sequence: </strong>All imaging was conducted using 3.0T or 1.5T scanners. The imaging sequence included axial T1-weighted fast spin-echo, T2-weighted fast spin-echo, T2-weighted fat-suppressed spin-echo, and Contrast-Enhanced T1-weighted fast spin-echo.</p><p><strong>Assessment: </strong>Two radiologists established the reference standards for TN staging at baseline and for TRE at two time points: post-induction chemotherapy (TRE-1) and post-concurrent chemoradiotherapy (TRE-2), based on the 9th version of AJCC/UICC guidelines and the RECIST1.1 criteria. LLMs were via few-shot chain-of-thought prompting and tested on 277 patients with 831 reports. Additionally, four radiologists independently assessed 68 cases both with and without the assistance of LLMs and compared the performance and efficiency in both conditions.</p><p><strong>Statistical tests: </strong>McNemar-Bowker test, Wilcoxon signed-rank test. p < 0.05 was considered statistically significant.</p><p><strong>Results: </strong>DeepSeek-V3-0324 significantly outperformed GPT-4o-latest in TRE-1 staging (96.5% vs. 82.9%, p < 0.001). For T staging (95.3% vs. 93.5%, p = 0.24), N staging (93.8% vs. 89.6%, p = 0.265), and TRE-2 (94.9% vs. 93.2%, p = 0.556), the accuracy between DeepSeek-V3-0324 and ChatGPT-4o-latest showed no significant difference. DeepSeek-V3-0324 also showed stronger agreement with expert annotation (κ = 0.85-0.90), compared to ChatGPT-4o-latest (κ = 0.49-0.86). Significant improvements in time efficiency were observed across all radiologists with LLM assistance (p < 0.001).</p><p><strong>Data conclusion: </strong>LLMs, particularly DeepSeek-V3-0324, can automate NPC TN staging and TRE with high accuracy, enhancing clinical efficiency. LLMs integration may improve diagnostic consistency, especially for junior clinicians.</p><p><strong>Level of evidence: 3: </strong></p><p><strong>Technical efficacy: </strong>Stage 4.</p>","PeriodicalId":16140,"journal":{"name":"Journal of Magnetic Resonance Imaging","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Application of Large Language Models in TN Staging and Treatment Response Evaluation for Patients With Nasopharyngeal Carcinoma: A Comparative Performance Analysis of ChatGPT-4o-Latest and DeepSeek-V3-0324.\",\"authors\":\"Yifan Yang, Fan Yang, Shengsheng Xiao, Kaiqi Hou, Kexin Chen, Zaiyi Liu, Changhong Liang, Xiaobo Chen, Guangyi Wang\",\"doi\":\"10.1002/jmri.70140\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Accurate tumor staging and treatment response evaluation (TRE) are critical for nasopharyngeal carcinoma (NPC) clinical decisions. Conventional methods relying on manual imaging analysis are expertise-dependent, time-consuming, and prone to inter-observer variability and errors.</p><p><strong>Purpose: </strong>To assess the performance of two large language models (LLMs): ChatGPT-4o-latest and DeepSeek-V3-0324 in automating T, N staging and TRE for NPC patients.</p><p><strong>Study type: </strong>Retrospective.</p><p><strong>Population: </strong>Three hundred seven NPC patients from three centers (mean age: 45.5 ± 11.3 years; 216 men, 91 women).</p><p><strong>Field strength/sequence: </strong>All imaging was conducted using 3.0T or 1.5T scanners. The imaging sequence included axial T1-weighted fast spin-echo, T2-weighted fast spin-echo, T2-weighted fat-suppressed spin-echo, and Contrast-Enhanced T1-weighted fast spin-echo.</p><p><strong>Assessment: </strong>Two radiologists established the reference standards for TN staging at baseline and for TRE at two time points: post-induction chemotherapy (TRE-1) and post-concurrent chemoradiotherapy (TRE-2), based on the 9th version of AJCC/UICC guidelines and the RECIST1.1 criteria. LLMs were via few-shot chain-of-thought prompting and tested on 277 patients with 831 reports. Additionally, four radiologists independently assessed 68 cases both with and without the assistance of LLMs and compared the performance and efficiency in both conditions.</p><p><strong>Statistical tests: </strong>McNemar-Bowker test, Wilcoxon signed-rank test. p < 0.05 was considered statistically significant.</p><p><strong>Results: </strong>DeepSeek-V3-0324 significantly outperformed GPT-4o-latest in TRE-1 staging (96.5% vs. 82.9%, p < 0.001). For T staging (95.3% vs. 93.5%, p = 0.24), N staging (93.8% vs. 89.6%, p = 0.265), and TRE-2 (94.9% vs. 93.2%, p = 0.556), the accuracy between DeepSeek-V3-0324 and ChatGPT-4o-latest showed no significant difference. DeepSeek-V3-0324 also showed stronger agreement with expert annotation (κ = 0.85-0.90), compared to ChatGPT-4o-latest (κ = 0.49-0.86). Significant improvements in time efficiency were observed across all radiologists with LLM assistance (p < 0.001).</p><p><strong>Data conclusion: </strong>LLMs, particularly DeepSeek-V3-0324, can automate NPC TN staging and TRE with high accuracy, enhancing clinical efficiency. LLMs integration may improve diagnostic consistency, especially for junior clinicians.</p><p><strong>Level of evidence: 3: </strong></p><p><strong>Technical efficacy: </strong>Stage 4.</p>\",\"PeriodicalId\":16140,\"journal\":{\"name\":\"Journal of Magnetic Resonance Imaging\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Magnetic Resonance Imaging\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/jmri.70140\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Magnetic Resonance Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/jmri.70140","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
Application of Large Language Models in TN Staging and Treatment Response Evaluation for Patients With Nasopharyngeal Carcinoma: A Comparative Performance Analysis of ChatGPT-4o-Latest and DeepSeek-V3-0324.
Background: Accurate tumor staging and treatment response evaluation (TRE) are critical for nasopharyngeal carcinoma (NPC) clinical decisions. Conventional methods relying on manual imaging analysis are expertise-dependent, time-consuming, and prone to inter-observer variability and errors.
Purpose: To assess the performance of two large language models (LLMs): ChatGPT-4o-latest and DeepSeek-V3-0324 in automating T, N staging and TRE for NPC patients.
Study type: Retrospective.
Population: Three hundred seven NPC patients from three centers (mean age: 45.5 ± 11.3 years; 216 men, 91 women).
Field strength/sequence: All imaging was conducted using 3.0T or 1.5T scanners. The imaging sequence included axial T1-weighted fast spin-echo, T2-weighted fast spin-echo, T2-weighted fat-suppressed spin-echo, and Contrast-Enhanced T1-weighted fast spin-echo.
Assessment: Two radiologists established the reference standards for TN staging at baseline and for TRE at two time points: post-induction chemotherapy (TRE-1) and post-concurrent chemoradiotherapy (TRE-2), based on the 9th version of AJCC/UICC guidelines and the RECIST1.1 criteria. LLMs were via few-shot chain-of-thought prompting and tested on 277 patients with 831 reports. Additionally, four radiologists independently assessed 68 cases both with and without the assistance of LLMs and compared the performance and efficiency in both conditions.
Statistical tests: McNemar-Bowker test, Wilcoxon signed-rank test. p < 0.05 was considered statistically significant.
Results: DeepSeek-V3-0324 significantly outperformed GPT-4o-latest in TRE-1 staging (96.5% vs. 82.9%, p < 0.001). For T staging (95.3% vs. 93.5%, p = 0.24), N staging (93.8% vs. 89.6%, p = 0.265), and TRE-2 (94.9% vs. 93.2%, p = 0.556), the accuracy between DeepSeek-V3-0324 and ChatGPT-4o-latest showed no significant difference. DeepSeek-V3-0324 also showed stronger agreement with expert annotation (κ = 0.85-0.90), compared to ChatGPT-4o-latest (κ = 0.49-0.86). Significant improvements in time efficiency were observed across all radiologists with LLM assistance (p < 0.001).
Data conclusion: LLMs, particularly DeepSeek-V3-0324, can automate NPC TN staging and TRE with high accuracy, enhancing clinical efficiency. LLMs integration may improve diagnostic consistency, especially for junior clinicians.
期刊介绍:
The Journal of Magnetic Resonance Imaging (JMRI) is an international journal devoted to the timely publication of basic and clinical research, educational and review articles, and other information related to the diagnostic applications of magnetic resonance.