Evaluation of DeepSeek-R1 and ChatGPT-4o as educational sources for upper tract urothelial carcinoma.

IF 1.9 Q3 UROLOGY & NEPHROLOGY

Central European Journal of Urology Pub Date : 2026-01-01 Epub Date: 2026-01-13 DOI:10.5173/ceju.2025.0238

Wojciech Krajewski, Jan Łaszkiewicz, Łukasz Biesiadecki, Wojciech Tomczak, Łukasz Nowak, Piotr Łaszkiewicz, Joanna Chorbińska, Francesco Del Giudice, Benjamin I Chung, Tomasz Szydełko

{"title":"Evaluation of DeepSeek-R1 and ChatGPT-4o as educational sources for upper tract urothelial carcinoma.","authors":"Wojciech Krajewski, Jan Łaszkiewicz, Łukasz Biesiadecki, Wojciech Tomczak, Łukasz Nowak, Piotr Łaszkiewicz, Joanna Chorbińska, Francesco Del Giudice, Benjamin I Chung, Tomasz Szydełko","doi":"10.5173/ceju.2025.0238","DOIUrl":null,"url":null,"abstract":"Introduction: Upper tract urothelial carcinoma (UTUC) is associated with poor survival outcomes. Therefore, providing reliable information about UTUC is crucial. Recently, chatbots powered by large language models have become a widely used information source. Our aim was to evaluate and compare responses generated by ChatGPT-4o and DeepSeek-R1 to patient-important questions regarding UTUC.Material and methods: A set of 43 questions assigned into four categories (general information, symptoms and diagnosis, treatment, prognosis) was curated. Each question was entered into DeepSeek-R1 and ChatGPT-4o. Answers were rated by two urologists using a scale from 1 (completely incorrect) to 4 (fully correct). The median score was calculated for each question. Median scores ≥3 were considered accurate. The repeatability of responses was evaluated using cosine similarity. The number of words in responses was counted.Results: The median scores for DeepSeek-R1 and ChatGPT-4o were both 3.5. There was no statistically significant difference between the scores assigned to two chatbots for all questions (p = 0.35), nor for any particular category.DeepSeek-R1 and ChatGPT-4o provided satisfactory answers for 93% and 91% of the evaluated questions, respectively. No potentially dangerous information was found. Both models consistently generated responses with moderate-high similarity (cosine similarity >0.5), except in one query. Finally, DeepSeek-R1 provided significantly longer answers than ChatGPT-4o (p <0.001).Conclusions: Both DeepSeek-R1 and ChatGPT-4o predominantly provide satisfactory responses to patient-important questions about UTUC. Artificial intelligence chatbots demonstrate potential as the first-line information sources for patients but struggle with highly specialized inquiries and thus cannot replace expert medical advice.","PeriodicalId":9744,"journal":{"name":"Central European Journal of Urology","volume":"79 1","pages":"1-8"},"PeriodicalIF":1.9000,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12976754/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Central European Journal of Urology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5173/ceju.2025.0238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/13 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Upper tract urothelial carcinoma (UTUC) is associated with poor survival outcomes. Therefore, providing reliable information about UTUC is crucial. Recently, chatbots powered by large language models have become a widely used information source. Our aim was to evaluate and compare responses generated by ChatGPT-4o and DeepSeek-R1 to patient-important questions regarding UTUC.

Material and methods: A set of 43 questions assigned into four categories (general information, symptoms and diagnosis, treatment, prognosis) was curated. Each question was entered into DeepSeek-R1 and ChatGPT-4o. Answers were rated by two urologists using a scale from 1 (completely incorrect) to 4 (fully correct). The median score was calculated for each question. Median scores ≥3 were considered accurate. The repeatability of responses was evaluated using cosine similarity. The number of words in responses was counted.

Results: The median scores for DeepSeek-R1 and ChatGPT-4o were both 3.5. There was no statistically significant difference between the scores assigned to two chatbots for all questions (p = 0.35), nor for any particular category.DeepSeek-R1 and ChatGPT-4o provided satisfactory answers for 93% and 91% of the evaluated questions, respectively. No potentially dangerous information was found. Both models consistently generated responses with moderate-high similarity (cosine similarity >0.5), except in one query. Finally, DeepSeek-R1 provided significantly longer answers than ChatGPT-4o (p <0.001).

Conclusions: Both DeepSeek-R1 and ChatGPT-4o predominantly provide satisfactory responses to patient-important questions about UTUC. Artificial intelligence chatbots demonstrate potential as the first-line information sources for patients but struggle with highly specialized inquiries and thus cannot replace expert medical advice.

Abstract Image

查看原文本刊更多论文

DeepSeek-R1和chatgpt - 40作为上尿路上皮癌教育源的评价。

导读：上尿路上皮癌（UTUC）与较差的生存结果相关。因此，提供有关UTUC的可靠信息至关重要。最近，由大型语言模型驱动的聊天机器人已经成为一种广泛使用的信息源。我们的目的是评估和比较chatgpt - 40和DeepSeek-R1对患者重要的UTUC问题的反应。材料和方法：收集了43个问题，分为四类（一般信息、症状和诊断、治疗、预后）。每个问题都被输入DeepSeek-R1和chatgpt - 40。两名泌尿科医生用1（完全错误）到4（完全正确）的等级对答案进行评分。每个问题的中位数是计算出来的。中位数得分≥3被认为是准确的。用余弦相似度评价反应的可重复性。统计了回复中的字数。结果：DeepSeek-R1和chatgpt - 40的中位得分均为3.5分。两个聊天机器人在所有问题上的得分没有统计学上的显著差异（p = 0.35），在任何特定类别上也是如此。DeepSeek-R1和chatgpt - 40分别为93%和91%的评估问题提供了满意的答案。没有发现潜在的危险信息。除了一个查询外，这两个模型都一致地生成具有中高相似性（余弦相似性>.5）的响应。最后，DeepSeek-R1提供的答案明显长于chatgpt - 40 (p结论：DeepSeek-R1和chatgpt - 40对患者重要的UTUC问题都提供了满意的答案。人工智能聊天机器人显示出作为患者第一线信息来源的潜力，但在高度专业化的询问方面存在困难，因此无法取代专家医疗建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Central European Journal of Urology UROLOGY & NEPHROLOGY-

CiteScore

2.30

自引率

8.30%

发文量