Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models

Fabio Borgonovo MD , Takahiro Matsuo MD , Francesco Petri MD , Seyed Mohammad Amin Alavi MD , Laura Chelsea Mazudie Ndjonko , Andrea Gori MD , Elie F. Berbari MD, MBA
{"title":"Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models","authors":"Fabio Borgonovo MD ,&nbsp;Takahiro Matsuo MD ,&nbsp;Francesco Petri MD ,&nbsp;Seyed Mohammad Amin Alavi MD ,&nbsp;Laura Chelsea Mazudie Ndjonko ,&nbsp;Andrea Gori MD ,&nbsp;Elie F. Berbari MD, MBA","doi":"10.1016/j.mcpdig.2025.100230","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>To evaluate the ability of 15 different large language models (LLMs) to solve clinical cases with osteoarticular infections following published guidelines.</div></div><div><h3>Materials and Methods</h3><div>The study evaluated 15 LLMs across 5 categories of osteoarticular infections: periprosthetic joint infection, diabetic foot infection, native vertebral osteomyelitis, fracture-related infections, and septic arthritis. Models were selected systematically, including general-purpose and medical-specific systems, ensuring robust English support. In total, 126 text-based questions, developed by the authors from published guidelines and validated by experts, assessed diagnostic, management, and treatment strategies. Each model answered individually, with responses classified as correct or incorrect based on guidelines. All tests were conducted between April 17, 2025, and April 28, 2025. Results, presented as percentages of correct answers and aggregated scores, highlight performance trends. Mixed-effects logistic regression with a random question effect was used to quantify how each LLM compared in answering the study questions.</div></div><div><h3>Results</h3><div>The performance of 15 LLMs was evaluated, with the percentage of correct answers reported. OpenEvidence and Microsoft Copilot achieved the highest score (119/126 [94.4%]), excelling in multiple categories. ChatGPT-4o and Gemini 2.5 Pro scored 117 of the 126 (92.8%). When used as references, OpenEvidence was not inferior to any comparator and was superior to 5 LLMs. Performance varied across categories, highlighting the strengths and limitations of individual models.</div></div><div><h3>Conclusion</h3><div>OpenEvidence and Miccrosoft Copilot achieved the highest accuracy among evaluated LLMs, highlighting their potential for precisely addressing complex clinical cases. This study emphasizes the need for specialized, validated artificial intelligence tools in medical practice. Although promising, current models face limitations in real-world applications, requiring further refinement to support clinical decision making reliably.</div></div>","PeriodicalId":74127,"journal":{"name":"Mayo Clinic Proceedings. Digital health","volume":"3 3","pages":"Article 100230"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mayo Clinic Proceedings. Digital health","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949761225000379","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective

To evaluate the ability of 15 different large language models (LLMs) to solve clinical cases with osteoarticular infections following published guidelines.

Materials and Methods

The study evaluated 15 LLMs across 5 categories of osteoarticular infections: periprosthetic joint infection, diabetic foot infection, native vertebral osteomyelitis, fracture-related infections, and septic arthritis. Models were selected systematically, including general-purpose and medical-specific systems, ensuring robust English support. In total, 126 text-based questions, developed by the authors from published guidelines and validated by experts, assessed diagnostic, management, and treatment strategies. Each model answered individually, with responses classified as correct or incorrect based on guidelines. All tests were conducted between April 17, 2025, and April 28, 2025. Results, presented as percentages of correct answers and aggregated scores, highlight performance trends. Mixed-effects logistic regression with a random question effect was used to quantify how each LLM compared in answering the study questions.

Results

The performance of 15 LLMs was evaluated, with the percentage of correct answers reported. OpenEvidence and Microsoft Copilot achieved the highest score (119/126 [94.4%]), excelling in multiple categories. ChatGPT-4o and Gemini 2.5 Pro scored 117 of the 126 (92.8%). When used as references, OpenEvidence was not inferior to any comparator and was superior to 5 LLMs. Performance varied across categories, highlighting the strengths and limitations of individual models.

Conclusion

OpenEvidence and Miccrosoft Copilot achieved the highest accuracy among evaluated LLMs, highlighting their potential for precisely addressing complex clinical cases. This study emphasizes the need for specialized, validated artificial intelligence tools in medical practice. Although promising, current models face limitations in real-world applications, requiring further refinement to support clinical decision making reliably.
机器人之战:用大型语言模型解决骨关节感染的临床病例
目的评价15种不同的大语言模型(LLMs)在临床骨关节感染治疗中的应用能力。材料和方法本研究评估了5类骨关节感染的15例llm:假体周围关节感染、糖尿病足感染、原生椎体骨髓炎、骨折相关感染和脓毒性关节炎。系统地选择模型,包括通用和医疗特定系统,确保强大的英语支持。总共有126个基于文本的问题,由作者根据已发表的指南开发并经专家验证,评估了诊断、管理和治疗策略。每个模型都单独回答,根据指导原则将回答分为正确或不正确。所有试验都是在2025年4月17日至2025年4月28日之间进行的。结果以正确答案的百分比和综合分数的形式呈现,突出了表现趋势。使用随机问题效应的混合效应逻辑回归来量化每个LLM在回答研究问题时的比较。结果对15名法学硕士的学习成绩进行了评估,并报告了答对的百分比。OpenEvidence和Microsoft Copilot得分最高(119/126[94.4%]),在多个类别中表现优异。chatgpt - 40和Gemini 2.5 Pro在126个测试中获得117分(92.8%)。当用作参考时,OpenEvidence不逊于任何比较器,优于5个llm。不同类别的性能各不相同,突出了单个模型的优点和局限性。结论openevidence和microsoft Copilot在被评估的llm中获得了最高的准确性,突出了它们在精确处理复杂临床病例方面的潜力。这项研究强调了在医疗实践中需要专门的、经过验证的人工智能工具。虽然有希望,但目前的模型在实际应用中面临局限性,需要进一步改进以可靠地支持临床决策。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Mayo Clinic Proceedings. Digital health
Mayo Clinic Proceedings. Digital health Medicine and Dentistry (General), Health Informatics, Public Health and Health Policy
自引率
0.00%
发文量
0
审稿时长
47 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信