Comparative Analysis of Performance of Large Language Models in Urogynecology.

IF 1.2 Q4 OBSTETRICS & GYNECOLOGY

Urogynecology (Hagerstown, Md.) Pub Date : 2025-07-01 Epub Date: 2024-06-27 DOI:10.1097/SPV.0000000000001545

Ghanshyam S Yadav, Kshitij Pandit, Phillip T Connell, Hadi Erfani, Charles W Nager

{"title":"Comparative Analysis of Performance of Large Language Models in Urogynecology.","authors":"Ghanshyam S Yadav, Kshitij Pandit, Phillip T Connell, Hadi Erfani, Charles W Nager","doi":"10.1097/SPV.0000000000001545","DOIUrl":null,"url":null,"abstract":"Importance: Despite growing popularity in medicine, data on large language models in urogynecology are lacking.Objective: The aim of this study was to compare the performance of ChatGPT-3.5, GPT-4, and Bard on the American Urogynecologic Society self-assessment examination.Study design: The examination features 185 questions with a passing score of 80. We tested 3 models-ChatGPT-3.5, GPT-4, and Bard on every question. Dedicated accounts enabled controlled comparisons. Questions with prompts were inputted into each model's interface, and responses were evaluated for correctness, logical reasoning behind answer choice, and sourcing. Data on subcategory, question type, correctness rate, question difficulty, and reference quality were noted. The Fisher exact or χ 2 test was used for statistical analysis.Results: Out of 185 questions, GPT-4 answered 61.6% questions correctly compared with 54.6% for GPT-3.5 and 42.7% for Bard. GPT-4 answered all questions, whereas GPT-3.5 and Bard declined to answer 4 and 25 questions, respectively. All models demonstrated logical reasoning in their correct responses. Performance of all large language models was inversely proportional to the difficulty level of the questions. Bard referenced sources 97.5% of the time, more often than GPT-4 (83.3%) and GPT-3.5 (39%). GPT-3.5 cited books and websites, whereas GPT-4 and Bard additionally cited journal articles and society guidelines. Median journal impact factor and number of citations were 3.6 with 20 citations for GPT-4 and 2.6 with 25 citations for Bard.Conclusions: Although GPT-4 outperformed GPT-3.5 and Bard, none of the models achieved a passing score. Clinicians should use language models cautiously in patient care scenarios until more evidence emerges.","PeriodicalId":75288,"journal":{"name":"Urogynecology (Hagerstown, Md.)","volume":" ","pages":"713-719"},"PeriodicalIF":1.2000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Urogynecology (Hagerstown, Md.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1097/SPV.0000000000001545","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/27 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"OBSTETRICS & GYNECOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Importance: Despite growing popularity in medicine, data on large language models in urogynecology are lacking.

Objective: The aim of this study was to compare the performance of ChatGPT-3.5, GPT-4, and Bard on the American Urogynecologic Society self-assessment examination.

Study design: The examination features 185 questions with a passing score of 80. We tested 3 models-ChatGPT-3.5, GPT-4, and Bard on every question. Dedicated accounts enabled controlled comparisons. Questions with prompts were inputted into each model's interface, and responses were evaluated for correctness, logical reasoning behind answer choice, and sourcing. Data on subcategory, question type, correctness rate, question difficulty, and reference quality were noted. The Fisher exact or χ 2 test was used for statistical analysis.

Results: Out of 185 questions, GPT-4 answered 61.6% questions correctly compared with 54.6% for GPT-3.5 and 42.7% for Bard. GPT-4 answered all questions, whereas GPT-3.5 and Bard declined to answer 4 and 25 questions, respectively. All models demonstrated logical reasoning in their correct responses. Performance of all large language models was inversely proportional to the difficulty level of the questions. Bard referenced sources 97.5% of the time, more often than GPT-4 (83.3%) and GPT-3.5 (39%). GPT-3.5 cited books and websites, whereas GPT-4 and Bard additionally cited journal articles and society guidelines. Median journal impact factor and number of citations were 3.6 with 20 citations for GPT-4 and 2.6 with 25 citations for Bard.

Conclusions: Although GPT-4 outperformed GPT-3.5 and Bard, none of the models achieved a passing score. Clinicians should use language models cautiously in patient care scenarios until more evidence emerges.

查看原文本刊更多论文

泌尿妇科大型语言模型性能对比分析

重要性：尽管在医学界越来越受欢迎，但有关泌尿妇科大型语言模型的数据却很缺乏：本研究旨在比较 ChatGPT-3.5、GPT-4 和 Bard 在美国泌尿妇科协会自我评估考试中的表现：研究设计：该考试共有 185 道题，及格分数为 80 分。我们在每道试题上测试了 3 种模型--ChatGPT-3.5、GPT-4 和 Bard。通过专用账户进行对照比较。我们将带有提示的问题输入每个模型的界面，并对答案的正确性、答案选择背后的逻辑推理以及来源进行评估。此外，还记录了有关子类别、问题类型、正确率、问题难度和参考质量的数据。统计分析采用费雪精确检验或χ2检验：在 185 个问题中，GPT-4 回答正确率为 61.6%，而 GPT-3.5 为 54.6%，Bard 为 42.7%。GPT-4 回答了所有问题，而 GPT-3.5 和 Bard 分别拒绝回答 4 个和 25 个问题。所有模型的正确回答都体现了逻辑推理。所有大语言模型的表现都与问题的难度成反比。Bard 有 97.5% 的时间引用了资料来源，比 GPT-4 (83.3%) 和 GPT-3.5 (39%) 更频繁。GPT-3.5 引用了书籍和网站，而 GPT-4 和 Bard 还引用了期刊论文和学会指南。GPT-4 的期刊影响因子和引用次数中位数分别为 3.6 和 20 次，Bard 的期刊影响因子和引用次数中位数分别为 2.6 和 25 次：尽管 GPT-4 的表现优于 GPT-3.5 和 Bard，但没有一个模型达到及格分数。在出现更多证据之前，临床医生应在患者护理方案中谨慎使用语言模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Urogynecology (Hagerstown, Md.)

CiteScore

2.80

自引率

0.00%

发文量