From ChatGPT to UroGPT: A guideline-trained artificial intelligence model for male infertility.

IF 1.3 4区医学 Q4 UROLOGY & NEPHROLOGY

Current Urology Pub Date : 2026-05-01 Epub Date: 2026-01-29 DOI:10.1097/CU9.0000000000000328

Elie Kaplan-Marans, Yitzchak E Katlowitz, Michael West, Navid Leelani, Christopher Edwards, David Silver, Jacob Khurgin

{"title":"From ChatGPT to UroGPT: A guideline-trained artificial intelligence model for male infertility.","authors":"Elie Kaplan-Marans, Yitzchak E Katlowitz, Michael West, Navid Leelani, Christopher Edwards, David Silver, Jacob Khurgin","doi":"10.1097/CU9.0000000000000328","DOIUrl":null,"url":null,"abstract":"Background: ChatGPT is not yet sufficiently reliable for answering clinical questions relevant to direct patient care. We hypothesized that a GPT model trained exclusively on expert guidelines would provide more accurate, guideline-concordant responses.Materials and methods: With permission from the European Association of Urology, we developed UroGPT, a custom GPT model trained solely on the European Association of Urology guidelines. We posed 25 clinical questions derived from the Male Infertility Guidelines and expert opinions to both the standard ChatGPT (GPT-4o) and UroGPT. Responses were anonymized and graded by 2 blinded reviewers as \"complete and accurate,\" \"incomplete but accurate,\" and \"incorrect or misleading.\" Guideline concordance was compared using the chi-square test.Results: UroGPT demonstrated significantly greater concordance with guideline-based responses than ChatGPT (p < 0.001). UroGPT provided 94% (47/50) complete and accurate responses, whereas ChatGPT provided only 38% (19/50). ChatGPT also produced a significantly higher rate of incorrect or misleading responses (52% vs. 4%). Inter-reviewer agreement was higher for UroGPT (88% vs. 48%), suggesting that its answers were clearer and more consistent with the guidelines. ChatGPT frequently overgeneralized, recommended unsupported interventions, or offered non-guideline-based lifestyle advice. However, both models failed to answer correctly 2 high-stakes questions regarding orchiectomy in patients with undescended testes.Conclusions: UroGPT markedly outperformed ChatGPT in guideline concordance. Training artificial intelligence models on expert-authored content represents a meaningful step toward developing clinically useful large language models. However, UroGPT is not yet appropriate for direct patient care and should currently be used only for research and academic purposes.","PeriodicalId":39147,"journal":{"name":"Current Urology","volume":"20 3","pages":"135-140"},"PeriodicalIF":1.3000,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13068478/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Urology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1097/CU9.0000000000000328","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/29 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: ChatGPT is not yet sufficiently reliable for answering clinical questions relevant to direct patient care. We hypothesized that a GPT model trained exclusively on expert guidelines would provide more accurate, guideline-concordant responses.

Materials and methods: With permission from the European Association of Urology, we developed UroGPT, a custom GPT model trained solely on the European Association of Urology guidelines. We posed 25 clinical questions derived from the Male Infertility Guidelines and expert opinions to both the standard ChatGPT (GPT-4o) and UroGPT. Responses were anonymized and graded by 2 blinded reviewers as "complete and accurate," "incomplete but accurate," and "incorrect or misleading." Guideline concordance was compared using the chi-square test.

Results: UroGPT demonstrated significantly greater concordance with guideline-based responses than ChatGPT (p < 0.001). UroGPT provided 94% (47/50) complete and accurate responses, whereas ChatGPT provided only 38% (19/50). ChatGPT also produced a significantly higher rate of incorrect or misleading responses (52% vs. 4%). Inter-reviewer agreement was higher for UroGPT (88% vs. 48%), suggesting that its answers were clearer and more consistent with the guidelines. ChatGPT frequently overgeneralized, recommended unsupported interventions, or offered non-guideline-based lifestyle advice. However, both models failed to answer correctly 2 high-stakes questions regarding orchiectomy in patients with undescended testes.

Conclusions: UroGPT markedly outperformed ChatGPT in guideline concordance. Training artificial intelligence models on expert-authored content represents a meaningful step toward developing clinically useful large language models. However, UroGPT is not yet appropriate for direct patient care and should currently be used only for research and academic purposes.

查看原文本刊更多论文

从ChatGPT到UroGPT：指导训练的男性不育症人工智能模型。

背景：ChatGPT在回答与直接患者护理相关的临床问题方面还不够可靠。我们假设，专门训练专家指南的GPT模型将提供更准确、更符合指南的反应。材料和方法：在欧洲泌尿外科协会的许可下，我们开发了UroGPT，这是一种定制的GPT模型，仅根据欧洲泌尿外科协会的指导方针进行培训。我们根据男性不育指南和专家意见对标准ChatGPT （gpt - 40）和UroGPT提出了25个临床问题。回答是匿名的，并由两名盲法评论者打分，分为“完整和准确”、“不完整但准确”和“不正确或误导”。指南一致性比较采用卡方检验。结果：与ChatGPT相比，UroGPT与基于指南的反应的一致性显著提高（p < 0.001）。UroGPT提供了94%（47/50）完整和准确的回答，而ChatGPT仅提供38%（19/50）。ChatGPT还产生了明显更高的不正确或误导性回答率（52% vs. 4%）。UroGPT的审稿人间一致性更高（88%对48%），表明其答案更清晰，更符合指南。ChatGPT经常过度概括，推荐无支持的干预措施，或提供非指南基础的生活方式建议。然而，两种模型都未能正确回答关于隐睾患者睾丸切除术的两个高风险问题。结论：UroGPT在指南一致性方面明显优于ChatGPT。在专家撰写的内容上训练人工智能模型是朝着开发临床有用的大型语言模型迈出的有意义的一步。然而，UroGPT还不适合直接患者护理，目前应仅用于研究和学术目的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Current Urology Medicine-Urology

CiteScore

2.30

自引率

0.00%

发文量