Wojciech Krajewski, Jan Łaszkiewicz, Łukasz Biesiadecki, Wojciech Tomczak, Łukasz Nowak, Piotr Łaszkiewicz, Joanna Chorbińska, Francesco Del Giudice, Benjamin I Chung, Tomasz Szydełko
{"title":"Evaluation of DeepSeek-R1 and ChatGPT-4o as educational sources for upper tract urothelial carcinoma.","authors":"Wojciech Krajewski, Jan Łaszkiewicz, Łukasz Biesiadecki, Wojciech Tomczak, Łukasz Nowak, Piotr Łaszkiewicz, Joanna Chorbińska, Francesco Del Giudice, Benjamin I Chung, Tomasz Szydełko","doi":"10.5173/ceju.2025.0238","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Upper tract urothelial carcinoma (UTUC) is associated with poor survival outcomes. Therefore, providing reliable information about UTUC is crucial. Recently, chatbots powered by large language models have become a widely used information source. Our aim was to evaluate and compare responses generated by ChatGPT-4o and DeepSeek-R1 to patient-important questions regarding UTUC.</p><p><strong>Material and methods: </strong>A set of 43 questions assigned into four categories (general information, symptoms and diagnosis, treatment, prognosis) was curated. Each question was entered into DeepSeek-R1 and ChatGPT-4o. Answers were rated by two urologists using a scale from 1 (completely incorrect) to 4 (fully correct). The median score was calculated for each question. Median scores ≥3 were considered accurate. The repeatability of responses was evaluated using cosine similarity. The number of words in responses was counted.</p><p><strong>Results: </strong>The median scores for DeepSeek-R1 and ChatGPT-4o were both 3.5. There was no statistically significant difference between the scores assigned to two chatbots for all questions (p = 0.35), nor for any particular category.DeepSeek-R1 and ChatGPT-4o provided satisfactory answers for 93% and 91% of the evaluated questions, respectively. No potentially dangerous information was found. Both models consistently generated responses with moderate-high similarity (cosine similarity >0.5), except in one query. Finally, DeepSeek-R1 provided significantly longer answers than ChatGPT-4o (p <0.001).</p><p><strong>Conclusions: </strong>Both DeepSeek-R1 and ChatGPT-4o predominantly provide satisfactory responses to patient-important questions about UTUC. Artificial intelligence chatbots demonstrate potential as the first-line information sources for patients but struggle with highly specialized inquiries and thus cannot replace expert medical advice.</p>","PeriodicalId":9744,"journal":{"name":"Central European Journal of Urology","volume":"79 1","pages":"1-8"},"PeriodicalIF":1.9000,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12976754/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Central European Journal of Urology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5173/ceju.2025.0238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/13 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Upper tract urothelial carcinoma (UTUC) is associated with poor survival outcomes. Therefore, providing reliable information about UTUC is crucial. Recently, chatbots powered by large language models have become a widely used information source. Our aim was to evaluate and compare responses generated by ChatGPT-4o and DeepSeek-R1 to patient-important questions regarding UTUC.
Material and methods: A set of 43 questions assigned into four categories (general information, symptoms and diagnosis, treatment, prognosis) was curated. Each question was entered into DeepSeek-R1 and ChatGPT-4o. Answers were rated by two urologists using a scale from 1 (completely incorrect) to 4 (fully correct). The median score was calculated for each question. Median scores ≥3 were considered accurate. The repeatability of responses was evaluated using cosine similarity. The number of words in responses was counted.
Results: The median scores for DeepSeek-R1 and ChatGPT-4o were both 3.5. There was no statistically significant difference between the scores assigned to two chatbots for all questions (p = 0.35), nor for any particular category.DeepSeek-R1 and ChatGPT-4o provided satisfactory answers for 93% and 91% of the evaluated questions, respectively. No potentially dangerous information was found. Both models consistently generated responses with moderate-high similarity (cosine similarity >0.5), except in one query. Finally, DeepSeek-R1 provided significantly longer answers than ChatGPT-4o (p <0.001).
Conclusions: Both DeepSeek-R1 and ChatGPT-4o predominantly provide satisfactory responses to patient-important questions about UTUC. Artificial intelligence chatbots demonstrate potential as the first-line information sources for patients but struggle with highly specialized inquiries and thus cannot replace expert medical advice.