Peng-Wei Luo, Ji-Wen Liu, Xi Xie, Jia-Wei Jiang, Xin-Yu Huo, Zhen-Lin Chen, Zhang-Cheng Huang, Shao-Qin Jiang, Meng-Qiang Li
{"title":"DeepSeek与ChatGPT:用多种语言回答前列腺癌放疗问题的性能比较研究。","authors":"Peng-Wei Luo, Ji-Wen Liu, Xi Xie, Jia-Wei Jiang, Xin-Yu Huo, Zhen-Lin Chen, Zhang-Cheng Huang, Shao-Qin Jiang, Meng-Qiang Li","doi":"10.62347/UIAP7979","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>The medical information generated by large language models (LLM) is crucial for improving patient education and clinical decision-making. This study aims to evaluate the performance of two LLMs (DeepSeek and ChatGPT) in answering questions related to prostate cancer radiotherapy in both Chinese and English environments. Through a comparative analysis, we aim to determine which model can provide higher-quality answers in different language environments.</p><p><strong>Methods: </strong>A structured evaluation framework was developed using a set of clinically relevant questions covering three key domains: foundational knowledge, patient education, and treatment and follow-up care. Responses from DeepSeek and ChatGPT were generated in both English and Chinese and independently assessed by a panel of five oncology specialists using a five-point Likert scale. Statistical analyses, including the Wilcoxon signed-rank test, were performed to compare the models' performance across different linguistic contexts.</p><p><strong>Results: </strong>This study ultimately included 33 questions for scoring. In Chinese, DeepSeek outperformed ChatGPT, achieving top ratings (score = 5) in 75.76% vs. 36.36% of responses (P < 0.001), excelling in foundational knowledge (76.92% vs. 38.46%, <i>P</i> = 0.047) and treatment/follow-up (81.82% vs. 36.36%, <i>P</i> = 0.031). In English, ChatGPT showed comparable performance (66.7% vs. 54.55% top-rated responses, <i>P</i> = 0.236), with marginal advantages in treatment/follow-up (63.64% vs. 54.55%, <i>P</i> = 0.563). DeepSeek maintained strengths in English foundational knowledge (69.23% vs. 30.77%, <i>P</i> = 0.047) and patient education (88.89% vs. 55.56%, <i>P</i> = 0.125). These findings underscore DeepSeek's superior Chinese proficiency and language-specific optimization impacts.</p><p><strong>Conclusions: </strong>This study shows that DeepSeek performs excellently in providing Chinese medical information, while the two models perform similarly in an English environment. These findings underscore the importance of selecting language-specific artificial intelligence (AI) models to enhance the accuracy and reliability of medical AI applications. While both models show promise in supporting patient education and clinical decision-making, human expert review remains necessary to ensure response accuracy and minimize potential misinformation.</p>","PeriodicalId":7438,"journal":{"name":"American journal of clinical and experimental urology","volume":"13 2","pages":"176-185"},"PeriodicalIF":1.5000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12089221/pdf/","citationCount":"0","resultStr":"{\"title\":\"DeepSeek vs ChatGPT: a comparison study of their performance in answering prostate cancer radiotherapy questions in multiple languages.\",\"authors\":\"Peng-Wei Luo, Ji-Wen Liu, Xi Xie, Jia-Wei Jiang, Xin-Yu Huo, Zhen-Lin Chen, Zhang-Cheng Huang, Shao-Qin Jiang, Meng-Qiang Li\",\"doi\":\"10.62347/UIAP7979\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>The medical information generated by large language models (LLM) is crucial for improving patient education and clinical decision-making. This study aims to evaluate the performance of two LLMs (DeepSeek and ChatGPT) in answering questions related to prostate cancer radiotherapy in both Chinese and English environments. Through a comparative analysis, we aim to determine which model can provide higher-quality answers in different language environments.</p><p><strong>Methods: </strong>A structured evaluation framework was developed using a set of clinically relevant questions covering three key domains: foundational knowledge, patient education, and treatment and follow-up care. Responses from DeepSeek and ChatGPT were generated in both English and Chinese and independently assessed by a panel of five oncology specialists using a five-point Likert scale. Statistical analyses, including the Wilcoxon signed-rank test, were performed to compare the models' performance across different linguistic contexts.</p><p><strong>Results: </strong>This study ultimately included 33 questions for scoring. In Chinese, DeepSeek outperformed ChatGPT, achieving top ratings (score = 5) in 75.76% vs. 36.36% of responses (P < 0.001), excelling in foundational knowledge (76.92% vs. 38.46%, <i>P</i> = 0.047) and treatment/follow-up (81.82% vs. 36.36%, <i>P</i> = 0.031). In English, ChatGPT showed comparable performance (66.7% vs. 54.55% top-rated responses, <i>P</i> = 0.236), with marginal advantages in treatment/follow-up (63.64% vs. 54.55%, <i>P</i> = 0.563). DeepSeek maintained strengths in English foundational knowledge (69.23% vs. 30.77%, <i>P</i> = 0.047) and patient education (88.89% vs. 55.56%, <i>P</i> = 0.125). These findings underscore DeepSeek's superior Chinese proficiency and language-specific optimization impacts.</p><p><strong>Conclusions: </strong>This study shows that DeepSeek performs excellently in providing Chinese medical information, while the two models perform similarly in an English environment. These findings underscore the importance of selecting language-specific artificial intelligence (AI) models to enhance the accuracy and reliability of medical AI applications. While both models show promise in supporting patient education and clinical decision-making, human expert review remains necessary to ensure response accuracy and minimize potential misinformation.</p>\",\"PeriodicalId\":7438,\"journal\":{\"name\":\"American journal of clinical and experimental urology\",\"volume\":\"13 2\",\"pages\":\"176-185\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12089221/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American journal of clinical and experimental urology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.62347/UIAP7979\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q3\",\"JCRName\":\"UROLOGY & NEPHROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of clinical and experimental urology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.62347/UIAP7979","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0
摘要
大型语言模型(large language models, LLM)生成的医学信息对于改善患者教育和临床决策至关重要。本研究旨在评估两个llm (DeepSeek和ChatGPT)在中英文环境下回答前列腺癌放疗相关问题的表现。通过对比分析,我们的目的是确定哪种模型可以在不同的语言环境下提供更高质量的答案。方法:采用一系列临床相关问题,包括基础知识、患者教育、治疗和随访护理三个关键领域,建立了一个结构化的评估框架。来自DeepSeek和ChatGPT的回复以中英文生成,并由五位肿瘤学专家组成的小组使用五点李克特量表进行独立评估。统计分析,包括Wilcoxon符号秩检验,用于比较模型在不同语言背景下的表现。结果:本研究最终包括33个问题进行评分。在中文方面,DeepSeek优于ChatGPT,在75.76%对36.36%的应答中获得最高评分(得分= 5)(P < 0.001),在基础知识(76.92%对38.46%,P = 0.047)和治疗/随访(81.82%对36.36%,P = 0.031)方面表现突出。在英语中,ChatGPT表现出相当的效果(66.7% vs. 54.55%的最高评价,P = 0.236),在治疗/随访方面具有边际优势(63.64% vs. 54.55%, P = 0.563)。DeepSeek在英语基础知识(69.23% vs. 30.77%, P = 0.047)和患者教育(88.89% vs. 55.56%, P = 0.125)方面保持优势。这些发现强调了DeepSeek优越的中文熟练程度和特定语言优化的影响。结论:本研究表明,DeepSeek在提供中文医学信息方面表现出色,而两种模型在英语环境下的表现相似。这些发现强调了选择特定语言的人工智能(AI)模型以提高医疗人工智能应用的准确性和可靠性的重要性。虽然这两种模型都显示出支持患者教育和临床决策的希望,但人类专家审查仍然是必要的,以确保反应的准确性并最大限度地减少潜在的错误信息。
DeepSeek vs ChatGPT: a comparison study of their performance in answering prostate cancer radiotherapy questions in multiple languages.
Introduction: The medical information generated by large language models (LLM) is crucial for improving patient education and clinical decision-making. This study aims to evaluate the performance of two LLMs (DeepSeek and ChatGPT) in answering questions related to prostate cancer radiotherapy in both Chinese and English environments. Through a comparative analysis, we aim to determine which model can provide higher-quality answers in different language environments.
Methods: A structured evaluation framework was developed using a set of clinically relevant questions covering three key domains: foundational knowledge, patient education, and treatment and follow-up care. Responses from DeepSeek and ChatGPT were generated in both English and Chinese and independently assessed by a panel of five oncology specialists using a five-point Likert scale. Statistical analyses, including the Wilcoxon signed-rank test, were performed to compare the models' performance across different linguistic contexts.
Results: This study ultimately included 33 questions for scoring. In Chinese, DeepSeek outperformed ChatGPT, achieving top ratings (score = 5) in 75.76% vs. 36.36% of responses (P < 0.001), excelling in foundational knowledge (76.92% vs. 38.46%, P = 0.047) and treatment/follow-up (81.82% vs. 36.36%, P = 0.031). In English, ChatGPT showed comparable performance (66.7% vs. 54.55% top-rated responses, P = 0.236), with marginal advantages in treatment/follow-up (63.64% vs. 54.55%, P = 0.563). DeepSeek maintained strengths in English foundational knowledge (69.23% vs. 30.77%, P = 0.047) and patient education (88.89% vs. 55.56%, P = 0.125). These findings underscore DeepSeek's superior Chinese proficiency and language-specific optimization impacts.
Conclusions: This study shows that DeepSeek performs excellently in providing Chinese medical information, while the two models perform similarly in an English environment. These findings underscore the importance of selecting language-specific artificial intelligence (AI) models to enhance the accuracy and reliability of medical AI applications. While both models show promise in supporting patient education and clinical decision-making, human expert review remains necessary to ensure response accuracy and minimize potential misinformation.