Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?

IF 4.4 2区医学 Q1 UROLOGY & NEPHROLOGY

BJU International Pub Date : 2025-07-31 DOI:10.1111/bju.16873

Omer Faruk Asker, Muhammed Selim Recai, Yunus Emre Genc, Kader Ada Dogan, Tarik Emre Sener, Bahadir Sahin

{"title":"Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?","authors":"Omer Faruk Asker, Muhammed Selim Recai, Yunus Emre Genc, Kader Ada Dogan, Tarik Emre Sener, Bahadir Sahin","doi":"10.1111/bju.16873","DOIUrl":null,"url":null,"abstract":"ObjectiveTo evaluate widely used chatbots’ accuracy, calibration error, readability, and understandability with objective measurements by 35 questions derived from urology in‐service examinations, as the integration of large language models (LLMs) into healthcare has gained increasing attention, raising questions about their applications and limitations.Materials and MethodsA total of 35 European Board of Urology questions were asked to five LLMs with a standardised prompt that was systematically designed and used across all models: ChatGPT‐4o, DeepSeek‐R1, Gemini, Grok‐2, and Claude 3.5. Accuracy was calculated by Cohen's kappa for all models. Readability was assessed by Flesch Reading Ease, Gunning Fog, Coleman–Liau, Simple Measure of Gobbledygook, and Automated Readability Index, while understandability was determined by scores of residents’ ratings by a Likert scale.ResultsThe models and answer key were in substantial agreement with a Fleiss’ kappa of 0.701, and Cronbach's alpha of 0.914. For accuracy, Cohen's kappa was 0.767 for ChatGPT‐4o, 0.764 for DeepSeek‐R, and 0.765 for Grok‐2 (80% accuracy for each), followed by 0.729 for Claude 3.5 (77% accuracy) and 0.611 for Gemini (68.4% accuracy). The lowest calibration error was found in ChatGPT‐4o (19.2%) and DeepSeek‐R1 scored the highest for readability. In understandability analysis, Claude 3.5 had the highest rating compared to others.ConclusionChatbots demonstrated various powers across different tasks. DeepSeek‐R1, despite being just released, showed promising results in medical applications. These findings highlight the need for further optimisation to better understand the applications of chatbots in urology.","PeriodicalId":8985,"journal":{"name":"BJU International","volume":"97 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BJU International","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/bju.16873","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

ObjectiveTo evaluate widely used chatbots’ accuracy, calibration error, readability, and understandability with objective measurements by 35 questions derived from urology in‐service examinations, as the integration of large language models (LLMs) into healthcare has gained increasing attention, raising questions about their applications and limitations.Materials and MethodsA total of 35 European Board of Urology questions were asked to five LLMs with a standardised prompt that was systematically designed and used across all models: ChatGPT‐4o, DeepSeek‐R1, Gemini, Grok‐2, and Claude 3.5. Accuracy was calculated by Cohen's kappa for all models. Readability was assessed by Flesch Reading Ease, Gunning Fog, Coleman–Liau, Simple Measure of Gobbledygook, and Automated Readability Index, while understandability was determined by scores of residents’ ratings by a Likert scale.ResultsThe models and answer key were in substantial agreement with a Fleiss’ kappa of 0.701, and Cronbach's alpha of 0.914. For accuracy, Cohen's kappa was 0.767 for ChatGPT‐4o, 0.764 for DeepSeek‐R, and 0.765 for Grok‐2 (80% accuracy for each), followed by 0.729 for Claude 3.5 (77% accuracy) and 0.611 for Gemini (68.4% accuracy). The lowest calibration error was found in ChatGPT‐4o (19.2%) and DeepSeek‐R1 scored the highest for readability. In understandability analysis, Claude 3.5 had the highest rating compared to others.ConclusionChatbots demonstrated various powers across different tasks. DeepSeek‐R1, despite being just released, showed promising results in medical applications. These findings highlight the need for further optimisation to better understand the applications of chatbots in urology.

查看原文本刊更多论文

泌尿外科中的聊天机器人：准确性、校准和可理解性DeepSeek会接管王位吗？

随着大型语言模型（llm）与医疗保健的整合越来越受到关注，人们对其应用和局限性提出了质疑，目的通过泌尿科门诊检查中35个问题的客观测量来评估广泛使用的聊天机器人的准确性、校准误差、可读性和可理解性。材料和方法共向5位法学硕士提出了35个欧洲泌尿学委员会的问题，并采用了一个标准化的提示，该提示系统地设计并用于所有模型：ChatGPT‐40、DeepSeek‐R1、Gemini、Grok‐2和Claude 3.5。所有模型的准确性由Cohen的kappa计算。可读性由Flesch Reading Ease， Gunning Fog, Coleman-Liau， Simple Measure of gobbledyook和Automated Readability Index来评估，而可理解性则由李克特量表的居民评分来确定。结果模型与答案基本一致，Fleiss的kappa为0.701，Cronbach的alpha为0.914。在准确性方面，ChatGPT‐40的Cohen kappa为0.767，DeepSeek‐R的kappa为0.764，Grok‐2的kappa为0.765（每种kappa的准确率均为80%），Claude 3.5的kappa为0.729（准确率为77%），Gemini的kappa为0.611（准确率为68.4%）。ChatGPT‐40的校准误差最低（19.2%），而DeepSeek‐R1的可读性得分最高。在可理解性分析中，Claude 3.5的评分最高。聊天机器人在不同的任务中表现出不同的能力。尽管刚刚发布，但DeepSeek‐R1在医疗应用方面显示出了令人鼓舞的结果。这些发现强调了进一步优化以更好地理解聊天机器人在泌尿外科中的应用的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BJU International 医学-泌尿学与肾脏学

CiteScore

9.10

自引率

4.40%

发文量

262

审稿时长

1 months

期刊介绍： BJUI is one of the most highly respected medical journals in the world, with a truly international range of published papers and appeal. Every issue gives invaluable practical information in the form of original articles, reviews, comments, surgical education articles, and translational science articles in the field of urology. BJUI employs topical sections, and is in full colour, making it easier to browse or search for something specific.