Comparison of ChatGPT-4, Copilot, Bard and Gemini Ultra on an Otolaryngology Question Bank

IF 1.5 4区医学 Q2 OTORHINOLARYNGOLOGY

Clinical Otolaryngology Pub Date : 2025-03-13 DOI:10.1111/coa.14302

Rashi Ramchandani, Eddie Guo, Michael Mostowy, Jason Kreutz, Nick Sahlollbey, Michele M. Carr, Janet Chung, Lisa Caulley

{"title":"Comparison of ChatGPT-4, Copilot, Bard and Gemini Ultra on an Otolaryngology Question Bank","authors":"Rashi Ramchandani, Eddie Guo, Michael Mostowy, Jason Kreutz, Nick Sahlollbey, Michele M. Carr, Janet Chung, Lisa Caulley","doi":"10.1111/coa.14302","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Objective</h3>\n \n <p>To compare the performance of Google Bard, Microsoft Copilot, GPT-4 with vision (GPT-4) and Gemini Ultra on the OTO Chautauqua, a student-created, faculty-reviewed otolaryngology question bank.</p>\n </section>\n \n <section>\n \n <h3> Study Design</h3>\n \n <p>Comparative performance evaluation of different LLMs.</p>\n </section>\n \n <section>\n \n <h3> Setting</h3>\n \n <p>N/A.</p>\n </section>\n \n <section>\n \n <h3> Participants</h3>\n \n <p>N/A.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Large language models (LLMs) are being extensively tested in medical education. However, their accuracy and effectiveness remain understudied, particularly in otolaryngology. This study involved inputting 350 single-best-answer multiple choice questions, including 18 image-based questions, into four LLMS. Questions were sourced from six independent question banks related to (a) rhinology, (b) head and neck oncology, (c) endocrinology, (d) general otolaryngology, (e) paediatrics, (f) otology, (g) facial plastics, reconstruction and (h) trauma. LLMs were instructed to provide an output reasoning for their answers, the length of which was recorded.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Aggregate and subgroup analysis revealed that Gemini (79.8%) outperformed the other LLMs, followed by GPT-4 (71.1%), Copilot (68.0%), and Bard (65.1%) in accuracy.</p>\n \n <p>The LLMs had significantly different average response lengths, with Bard (x̄ = 1685.24) being the longest and no difference between GPT-4 (x̄ = 827.34) and Copilot (x̄ = 904.12). Gemini's longer responses (x̄ =1291.68) included explanatory images and links. Gemini and GPT-4 correctly answered image-based questions (<i>n</i> = 18), unlike Copilot and Bard, highlighting their adaptability and multimodal capabilities.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>Gemini outperformed the other LLMs in terms of accuracy, followed by GPT-4, Copilot and Bard. GPT-4, although it has the second-highest accuracy, provides concise and relevant explanations. Despite the promising performance of LLMs, medical learners should cautiously assess accuracy and decision-making reliability.</p>\n </section>\n </div>","PeriodicalId":10431,"journal":{"name":"Clinical Otolaryngology","volume":"50 4","pages":"704-711"},"PeriodicalIF":1.5000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/coa.14302","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Otolaryngology","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coa.14302","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

To compare the performance of Google Bard, Microsoft Copilot, GPT-4 with vision (GPT-4) and Gemini Ultra on the OTO Chautauqua, a student-created, faculty-reviewed otolaryngology question bank.

Study Design

Comparative performance evaluation of different LLMs.

Setting

N/A.

Participants

N/A.

Methods

Large language models (LLMs) are being extensively tested in medical education. However, their accuracy and effectiveness remain understudied, particularly in otolaryngology. This study involved inputting 350 single-best-answer multiple choice questions, including 18 image-based questions, into four LLMS. Questions were sourced from six independent question banks related to (a) rhinology, (b) head and neck oncology, (c) endocrinology, (d) general otolaryngology, (e) paediatrics, (f) otology, (g) facial plastics, reconstruction and (h) trauma. LLMs were instructed to provide an output reasoning for their answers, the length of which was recorded.

Results

Aggregate and subgroup analysis revealed that Gemini (79.8%) outperformed the other LLMs, followed by GPT-4 (71.1%), Copilot (68.0%), and Bard (65.1%) in accuracy.

The LLMs had significantly different average response lengths, with Bard (x̄ = 1685.24) being the longest and no difference between GPT-4 (x̄ = 827.34) and Copilot (x̄ = 904.12). Gemini's longer responses (x̄ =1291.68) included explanatory images and links. Gemini and GPT-4 correctly answered image-based questions (n = 18), unlike Copilot and Bard, highlighting their adaptability and multimodal capabilities.

Conclusion

Gemini outperformed the other LLMs in terms of accuracy, followed by GPT-4, Copilot and Bard. GPT-4, although it has the second-highest accuracy, provides concise and relevant explanations. Despite the promising performance of LLMs, medical learners should cautiously assess accuracy and decision-making reliability.

Abstract Image

查看原文本刊更多论文

ChatGPT-4、Copilot、Bard和Gemini Ultra在耳鼻喉科题库中的比较

目的：比较谷歌Bard、Microsoft Copilot、GPT-4 with vision （GPT-4）和Gemini Ultra在OTO Chautauqua（学生设计、教师评审的耳鼻喉科考题库）上的表现。研究设计：比较不同llm的绩效评估。设置:N / A。参与者:N / A。方法：大型语言模型（LLMs）在医学教育中得到了广泛的测试。然而，它们的准确性和有效性仍有待研究，特别是在耳鼻喉科。该研究涉及将350个单最佳答案选择题（包括18个基于图像的问题）输入到四个LLMS中。问题来自六个独立的题库，涉及(a)鼻科、(b)头颈肿瘤学、(c)内分泌学、(d)普通耳鼻喉科、(e)儿科、(f)耳科、(g)面部整形、重建和(h)创伤。法学硕士被要求为他们的答案提供一个输出推理，其长度被记录下来。结果：总体和亚组分析显示，Gemini（79.8%）优于其他LLMs，其次是GPT-4（71.1%）、Copilot（68.0%）和Bard（65.1%）。llm的平均反应长度有显著差异，Bard （x′′= 1685.24）最长，GPT-4 （x′′= 827.34）和Copilot （x′′= 904.12）之间没有差异。双子座的较长回复（x ā =1291.68）包含解释性图片和链接。与Copilot和Bard不同，Gemini和GPT-4正确回答了基于图像的问题（n = 18），突出了它们的适应性和多模式能力。结论：Gemini在准确性方面优于其他LLMs，其次是GPT-4、Copilot和Bard。GPT-4虽然具有第二高的准确性，但提供了简洁和相关的解释。尽管法学硕士有很好的表现，医学学习者应该谨慎评估准确性和决策可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical Otolaryngology 医学-耳鼻喉科学

CiteScore

4.00

自引率

4.80%

发文量

106

审稿时长

>12 weeks

期刊介绍： Clinical Otolaryngology is a bimonthly journal devoted to clinically-oriented research papers of the highest scientific standards dealing with: current otorhinolaryngological practice audiology, otology, balance, rhinology, larynx, voice and paediatric ORL head and neck oncology head and neck plastic and reconstructive surgery continuing medical education and ORL training The emphasis is on high quality new work in the clinical field and on fresh, original research. Each issue begins with an editorial expressing the personal opinions of an individual with a particular knowledge of a chosen subject. The main body of each issue is then devoted to original papers carrying important results for those working in the field. In addition, topical review articles are published discussing a particular subject in depth, including not only the opinions of the author but also any controversies surrounding the subject. • Negative/null results In order for research to advance, negative results, which often make a valuable contribution to the field, should be published. However, articles containing negative or null results are frequently not considered for publication or rejected by journals. We welcome papers of this kind, where appropriate and valid power calculations are included that give confidence that a negative result can be relied upon.