Elie Kaplan-Marans, Yitzchak E Katlowitz, Michael West, Navid Leelani, Christopher Edwards, David Silver, Jacob Khurgin
{"title":"From ChatGPT to UroGPT: A guideline-trained artificial intelligence model for male infertility.","authors":"Elie Kaplan-Marans, Yitzchak E Katlowitz, Michael West, Navid Leelani, Christopher Edwards, David Silver, Jacob Khurgin","doi":"10.1097/CU9.0000000000000328","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>ChatGPT is not yet sufficiently reliable for answering clinical questions relevant to direct patient care. We hypothesized that a GPT model trained exclusively on expert guidelines would provide more accurate, guideline-concordant responses.</p><p><strong>Materials and methods: </strong>With permission from the European Association of Urology, we developed UroGPT, a custom GPT model trained solely on the European Association of Urology guidelines. We posed 25 clinical questions derived from the Male Infertility Guidelines and expert opinions to both the standard ChatGPT (GPT-4o) and UroGPT. Responses were anonymized and graded by 2 blinded reviewers as \"complete and accurate,\" \"incomplete but accurate,\" and \"incorrect or misleading.\" Guideline concordance was compared using the chi-square test.</p><p><strong>Results: </strong>UroGPT demonstrated significantly greater concordance with guideline-based responses than ChatGPT (<i>p</i> < 0.001). UroGPT provided 94% (47/50) complete and accurate responses, whereas ChatGPT provided only 38% (19/50). ChatGPT also produced a significantly higher rate of incorrect or misleading responses (52% vs. 4%). Inter-reviewer agreement was higher for UroGPT (88% vs. 48%), suggesting that its answers were clearer and more consistent with the guidelines. ChatGPT frequently overgeneralized, recommended unsupported interventions, or offered non-guideline-based lifestyle advice. However, both models failed to answer correctly 2 high-stakes questions regarding orchiectomy in patients with undescended testes.</p><p><strong>Conclusions: </strong>UroGPT markedly outperformed ChatGPT in guideline concordance. Training artificial intelligence models on expert-authored content represents a meaningful step toward developing clinically useful large language models. However, UroGPT is not yet appropriate for direct patient care and should currently be used only for research and academic purposes.</p>","PeriodicalId":39147,"journal":{"name":"Current Urology","volume":"20 3","pages":"135-140"},"PeriodicalIF":1.3000,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13068478/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Urology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1097/CU9.0000000000000328","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/29 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: ChatGPT is not yet sufficiently reliable for answering clinical questions relevant to direct patient care. We hypothesized that a GPT model trained exclusively on expert guidelines would provide more accurate, guideline-concordant responses.
Materials and methods: With permission from the European Association of Urology, we developed UroGPT, a custom GPT model trained solely on the European Association of Urology guidelines. We posed 25 clinical questions derived from the Male Infertility Guidelines and expert opinions to both the standard ChatGPT (GPT-4o) and UroGPT. Responses were anonymized and graded by 2 blinded reviewers as "complete and accurate," "incomplete but accurate," and "incorrect or misleading." Guideline concordance was compared using the chi-square test.
Results: UroGPT demonstrated significantly greater concordance with guideline-based responses than ChatGPT (p < 0.001). UroGPT provided 94% (47/50) complete and accurate responses, whereas ChatGPT provided only 38% (19/50). ChatGPT also produced a significantly higher rate of incorrect or misleading responses (52% vs. 4%). Inter-reviewer agreement was higher for UroGPT (88% vs. 48%), suggesting that its answers were clearer and more consistent with the guidelines. ChatGPT frequently overgeneralized, recommended unsupported interventions, or offered non-guideline-based lifestyle advice. However, both models failed to answer correctly 2 high-stakes questions regarding orchiectomy in patients with undescended testes.
Conclusions: UroGPT markedly outperformed ChatGPT in guideline concordance. Training artificial intelligence models on expert-authored content represents a meaningful step toward developing clinically useful large language models. However, UroGPT is not yet appropriate for direct patient care and should currently be used only for research and academic purposes.