Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces.

IF 2.2 3区 医学 Q2 OTORHINOLARYNGOLOGY
Cosima C Hoch, Paul F Funk, Orlando Guntinas-Lichius, Gerd Fabian Volk, Jan-Christoffer Lüers, Timon Hussain, Markus Wirth, Benedikt Schmidl, Barbara Wollenberg, Michael Alfertshofer
{"title":"Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces.","authors":"Cosima C Hoch, Paul F Funk, Orlando Guntinas-Lichius, Gerd Fabian Volk, Jan-Christoffer Lüers, Timon Hussain, Markus Wirth, Benedikt Schmidl, Barbara Wollenberg, Michael Alfertshofer","doi":"10.1007/s00405-025-09404-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study aimed to explore the capabilities of advanced large language models (LLMs), including OpenAI's GPT-4 variants, Google's Gemini series, and Anthropic's Claude series, in addressing highly specialized otolaryngology board examination questions. Additionally, the study included a longitudinal assessment of GPT-3.5 Turbo, which was evaluated using the same set of questions one year ago to identify changes in its performance over time.</p><p><strong>Methods: </strong>We utilized a question bank comprising 2,576 multiple-choice and single-choice questions from a German online education platform tailored for otolaryngology board certification preparation. The questions were submitted to 11 different LLMs, including GPT-3.5 Turbo, GPT-4 variants, Gemini models, and Claude models, through Application Programming Interfaces (APIs) using Python scripts, facilitating efficient data collection and processing.</p><p><strong>Results: </strong>GPT-4o demonstrated the highest accuracy among all models, particularly excelling in categories such as allergology and head and neck tumors. While the Claude models showed competitive performance, they generally lagged behind the GPT-4 variants. A comparison of GPT-3.5 Turbo's performance revealed a significant decline in accuracy over the past year. Newer LLMs displayed varied performance levels, with single-choice questions consistently yielding higher accuracy than multiple-choice questions across all models.</p><p><strong>Conclusion: </strong>While newer LLMs show strong potential in addressing specialized medical content, the observed decline in GPT-3.5 Turbo's performance over time underscores the necessity for continuous evaluation. This study highlights the critical need for ongoing optimization and efficient API usage to improve LLMs potential for applications in medical education and certification.</p>","PeriodicalId":11952,"journal":{"name":"European Archives of Oto-Rhino-Laryngology","volume":" ","pages":"3317-3328"},"PeriodicalIF":2.2000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12122622/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Archives of Oto-Rhino-Laryngology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00405-025-09404-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/25 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: This study aimed to explore the capabilities of advanced large language models (LLMs), including OpenAI's GPT-4 variants, Google's Gemini series, and Anthropic's Claude series, in addressing highly specialized otolaryngology board examination questions. Additionally, the study included a longitudinal assessment of GPT-3.5 Turbo, which was evaluated using the same set of questions one year ago to identify changes in its performance over time.

Methods: We utilized a question bank comprising 2,576 multiple-choice and single-choice questions from a German online education platform tailored for otolaryngology board certification preparation. The questions were submitted to 11 different LLMs, including GPT-3.5 Turbo, GPT-4 variants, Gemini models, and Claude models, through Application Programming Interfaces (APIs) using Python scripts, facilitating efficient data collection and processing.

Results: GPT-4o demonstrated the highest accuracy among all models, particularly excelling in categories such as allergology and head and neck tumors. While the Claude models showed competitive performance, they generally lagged behind the GPT-4 variants. A comparison of GPT-3.5 Turbo's performance revealed a significant decline in accuracy over the past year. Newer LLMs displayed varied performance levels, with single-choice questions consistently yielding higher accuracy than multiple-choice questions across all models.

Conclusion: While newer LLMs show strong potential in addressing specialized medical content, the observed decline in GPT-3.5 Turbo's performance over time underscores the necessity for continuous evaluation. This study highlights the critical need for ongoing optimization and efficient API usage to improve LLMs potential for applications in medical education and certification.

利用先进的大语言模型在耳鼻喉科董事会检查:使用python和应用程序编程接口的调查。
目的:本研究旨在探索先进的大型语言模型(llm)的能力,包括OpenAI的GPT-4变体,b谷歌的Gemini系列和Anthropic的Claude系列,以解决高度专业化的耳鼻喉科委员会考试问题。此外,该研究还包括对GPT-3.5 Turbo的纵向评估,该评估使用了一年前的相同问题集,以确定其性能随时间的变化。方法:我们使用了一个由2576个多项选择题和单项选择题组成的题库,该题库来自德国一个专门为耳鼻喉科委员会认证准备而定制的在线教育平台。这些问题通过使用Python脚本的应用程序编程接口(api)提交给11个不同的法学硕士,包括GPT-3.5 Turbo、GPT-4变体、Gemini模型和Claude模型,从而促进了高效的数据收集和处理。结果:gpt - 40在所有模型中显示出最高的准确性,特别是在过敏症和头颈部肿瘤等类别中表现出色。虽然克劳德模型表现出竞争力,但它们通常落后于GPT-4变体。GPT-3.5 Turbo的性能对比显示,在过去的一年中,其准确性显著下降。较新的llm表现出不同的性能水平,在所有模型中,单项选择题的准确率始终高于多项选择题。结论:虽然较新的llm在解决专业医疗内容方面显示出强大的潜力,但观察到GPT-3.5 Turbo性能随着时间的推移而下降,这强调了持续评估的必要性。这项研究强调了持续优化和有效使用API的迫切需要,以提高法学硕士在医学教育和认证中的应用潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.30
自引率
7.70%
发文量
537
审稿时长
2-4 weeks
期刊介绍: Official Journal of European Union of Medical Specialists – ORL Section and Board Official Journal of Confederation of European Oto-Rhino-Laryngology Head and Neck Surgery "European Archives of Oto-Rhino-Laryngology" publishes original clinical reports and clinically relevant experimental studies, as well as short communications presenting new results of special interest. With peer review by a respected international editorial board and prompt English-language publication, the journal provides rapid dissemination of information by authors from around the world. This particular feature makes it the journal of choice for readers who want to be informed about the continuing state of the art concerning basic sciences and the diagnosis and management of diseases of the head and neck on an international level. European Archives of Oto-Rhino-Laryngology was founded in 1864 as "Archiv für Ohrenheilkunde" by A. von Tröltsch, A. Politzer and H. Schwartze.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信