Omer Faruk Asker, Muhammed Selim Recai, Yunus Emre Genc, Kader Ada Dogan, Tarik Emre Sener, Bahadir Sahin
{"title":"泌尿外科中的聊天机器人:准确性、校准和可理解性DeepSeek会接管王位吗?","authors":"Omer Faruk Asker, Muhammed Selim Recai, Yunus Emre Genc, Kader Ada Dogan, Tarik Emre Sener, Bahadir Sahin","doi":"10.1111/bju.16873","DOIUrl":null,"url":null,"abstract":"ObjectiveTo evaluate widely used chatbots’ accuracy, calibration error, readability, and understandability with objective measurements by 35 questions derived from urology in‐service examinations, as the integration of large language models (LLMs) into healthcare has gained increasing attention, raising questions about their applications and limitations.Materials and MethodsA total of 35 European Board of Urology questions were asked to five LLMs with a standardised prompt that was systematically designed and used across all models: ChatGPT‐4o, DeepSeek‐R1, Gemini, Grok‐2, and Claude 3.5. Accuracy was calculated by Cohen's kappa for all models. Readability was assessed by Flesch Reading Ease, Gunning Fog, Coleman–Liau, Simple Measure of Gobbledygook, and Automated Readability Index, while understandability was determined by scores of residents’ ratings by a Likert scale.ResultsThe models and answer key were in substantial agreement with a Fleiss’ kappa of 0.701, and Cronbach's alpha of 0.914. For accuracy, Cohen's kappa was 0.767 for ChatGPT‐4o, 0.764 for DeepSeek‐R, and 0.765 for Grok‐2 (80% accuracy for each), followed by 0.729 for Claude 3.5 (77% accuracy) and 0.611 for Gemini (68.4% accuracy). The lowest calibration error was found in ChatGPT‐4o (19.2%) and DeepSeek‐R1 scored the highest for readability. In understandability analysis, Claude 3.5 had the highest rating compared to others.ConclusionChatbots demonstrated various powers across different tasks. DeepSeek‐R1, despite being just released, showed promising results in medical applications. These findings highlight the need for further optimisation to better understand the applications of chatbots in urology.","PeriodicalId":8985,"journal":{"name":"BJU International","volume":"97 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?\",\"authors\":\"Omer Faruk Asker, Muhammed Selim Recai, Yunus Emre Genc, Kader Ada Dogan, Tarik Emre Sener, Bahadir Sahin\",\"doi\":\"10.1111/bju.16873\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ObjectiveTo evaluate widely used chatbots’ accuracy, calibration error, readability, and understandability with objective measurements by 35 questions derived from urology in‐service examinations, as the integration of large language models (LLMs) into healthcare has gained increasing attention, raising questions about their applications and limitations.Materials and MethodsA total of 35 European Board of Urology questions were asked to five LLMs with a standardised prompt that was systematically designed and used across all models: ChatGPT‐4o, DeepSeek‐R1, Gemini, Grok‐2, and Claude 3.5. Accuracy was calculated by Cohen's kappa for all models. Readability was assessed by Flesch Reading Ease, Gunning Fog, Coleman–Liau, Simple Measure of Gobbledygook, and Automated Readability Index, while understandability was determined by scores of residents’ ratings by a Likert scale.ResultsThe models and answer key were in substantial agreement with a Fleiss’ kappa of 0.701, and Cronbach's alpha of 0.914. For accuracy, Cohen's kappa was 0.767 for ChatGPT‐4o, 0.764 for DeepSeek‐R, and 0.765 for Grok‐2 (80% accuracy for each), followed by 0.729 for Claude 3.5 (77% accuracy) and 0.611 for Gemini (68.4% accuracy). The lowest calibration error was found in ChatGPT‐4o (19.2%) and DeepSeek‐R1 scored the highest for readability. In understandability analysis, Claude 3.5 had the highest rating compared to others.ConclusionChatbots demonstrated various powers across different tasks. DeepSeek‐R1, despite being just released, showed promising results in medical applications. These findings highlight the need for further optimisation to better understand the applications of chatbots in urology.\",\"PeriodicalId\":8985,\"journal\":{\"name\":\"BJU International\",\"volume\":\"97 1\",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BJU International\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1111/bju.16873\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"UROLOGY & NEPHROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BJU International","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/bju.16873","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?
ObjectiveTo evaluate widely used chatbots’ accuracy, calibration error, readability, and understandability with objective measurements by 35 questions derived from urology in‐service examinations, as the integration of large language models (LLMs) into healthcare has gained increasing attention, raising questions about their applications and limitations.Materials and MethodsA total of 35 European Board of Urology questions were asked to five LLMs with a standardised prompt that was systematically designed and used across all models: ChatGPT‐4o, DeepSeek‐R1, Gemini, Grok‐2, and Claude 3.5. Accuracy was calculated by Cohen's kappa for all models. Readability was assessed by Flesch Reading Ease, Gunning Fog, Coleman–Liau, Simple Measure of Gobbledygook, and Automated Readability Index, while understandability was determined by scores of residents’ ratings by a Likert scale.ResultsThe models and answer key were in substantial agreement with a Fleiss’ kappa of 0.701, and Cronbach's alpha of 0.914. For accuracy, Cohen's kappa was 0.767 for ChatGPT‐4o, 0.764 for DeepSeek‐R, and 0.765 for Grok‐2 (80% accuracy for each), followed by 0.729 for Claude 3.5 (77% accuracy) and 0.611 for Gemini (68.4% accuracy). The lowest calibration error was found in ChatGPT‐4o (19.2%) and DeepSeek‐R1 scored the highest for readability. In understandability analysis, Claude 3.5 had the highest rating compared to others.ConclusionChatbots demonstrated various powers across different tasks. DeepSeek‐R1, despite being just released, showed promising results in medical applications. These findings highlight the need for further optimisation to better understand the applications of chatbots in urology.
期刊介绍:
BJUI is one of the most highly respected medical journals in the world, with a truly international range of published papers and appeal. Every issue gives invaluable practical information in the form of original articles, reviews, comments, surgical education articles, and translational science articles in the field of urology. BJUI employs topical sections, and is in full colour, making it easier to browse or search for something specific.