Comparative Performance of Deep-Reasoning and Lightweight Large Language Models on Oral Implantology Multiple-Choice Questions.

IF 1.8
Viet Anh Nguyen, Thi Quynh Trang Vuong, Van Hung Nguyen
{"title":"Comparative Performance of Deep-Reasoning and Lightweight Large Language Models on Oral Implantology Multiple-Choice Questions.","authors":"Viet Anh Nguyen, Thi Quynh Trang Vuong, Van Hung Nguyen","doi":"10.11607/ijp.9504","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs) show promise in dental education, but their performance on specialized implantology knowledge is unclear. This study compared the accuracy and response times of six LLM configurations on an implantology MCQ dataset to guide optimal model selection.</p><p><strong>Materials and methods: </strong>We administered 675 single-best-answer MCQs from a standard oral implantology question bank to six LLM setups in May 2025, including two OpenAI models (o3 and GPT-4o), two Microsoft Copilot modes (Deep and Quick, both based on o3-mini), and two Google Gemini variants (Flash and Pro). An independent assessor delivered questions in batches of ten using a uniform prompt, recorded each model's answers, and measured elapsed time per batch. Accuracy was the percentage of correct answers; response time was averaged across batches. χ² tests compared accuracy, and ANOVA compared response times.</p><p><strong>Results: </strong>Accuracy varied significantly (p = 0.001). Gemini Pro (83.1%) and o3 Any blinded information will be available then. (82.4%) achieved the highest rates, outperforming GPT-4o (76.9%). Copilot Deep (77.8%) did not significantly exceed Copilot Quick (75.1%). Deep reasoning models (o3, Gemini Pro) averaged 4-5 s per batch, while lightweight variants (Copilot Quick, Gemini Flash, GPT-4o) responded in under 1 s (p < 0.001). All models uniformly failed questions requiring precise epidemiological or contraindication data.</p><p><strong>Conclusions: </strong>Deep-reasoning LLMs deliver superior implantology MCQ accuracy at the cost of modestly longer inference times. Lightweight models offer near-instant responses with slightly lower accuracy. Aligning model choice to task complexity can optimize speed, cost, and diagnostic precision in implantology education and practice.</p>","PeriodicalId":94232,"journal":{"name":"The International journal of prosthodontics","volume":"0 0","pages":"1-20"},"PeriodicalIF":1.8000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International journal of prosthodontics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11607/ijp.9504","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: Large language models (LLMs) show promise in dental education, but their performance on specialized implantology knowledge is unclear. This study compared the accuracy and response times of six LLM configurations on an implantology MCQ dataset to guide optimal model selection.

Materials and methods: We administered 675 single-best-answer MCQs from a standard oral implantology question bank to six LLM setups in May 2025, including two OpenAI models (o3 and GPT-4o), two Microsoft Copilot modes (Deep and Quick, both based on o3-mini), and two Google Gemini variants (Flash and Pro). An independent assessor delivered questions in batches of ten using a uniform prompt, recorded each model's answers, and measured elapsed time per batch. Accuracy was the percentage of correct answers; response time was averaged across batches. χ² tests compared accuracy, and ANOVA compared response times.

Results: Accuracy varied significantly (p = 0.001). Gemini Pro (83.1%) and o3 Any blinded information will be available then. (82.4%) achieved the highest rates, outperforming GPT-4o (76.9%). Copilot Deep (77.8%) did not significantly exceed Copilot Quick (75.1%). Deep reasoning models (o3, Gemini Pro) averaged 4-5 s per batch, while lightweight variants (Copilot Quick, Gemini Flash, GPT-4o) responded in under 1 s (p < 0.001). All models uniformly failed questions requiring precise epidemiological or contraindication data.

Conclusions: Deep-reasoning LLMs deliver superior implantology MCQ accuracy at the cost of modestly longer inference times. Lightweight models offer near-instant responses with slightly lower accuracy. Aligning model choice to task complexity can optimize speed, cost, and diagnostic precision in implantology education and practice.

深度推理和轻量级大语言模型在口腔种植学选择题中的比较表现。
目的:大型语言模型(LLMs)在口腔教育中显示出良好的前景,但其在种植专业知识方面的表现尚不清楚。本研究比较了种植医学MCQ数据集上六种LLM配置的准确性和响应时间,以指导最佳模型选择。材料和方法:我们于2025年5月将标准口腔种植学题库中的675个单最佳答案mcq管理到六个LLM设置中,包括两个OpenAI模型(o3和gpt - 40),两个Microsoft Copilot模式(Deep和Quick,均基于o3-mini)和两个谷歌Gemini版本(Flash和Pro)。一个独立的评估员使用统一的提示,以10个为一个批次交付问题,记录每个模型的答案,并测量每个批次的运行时间。准确度是正确答案的百分比;每个批次的响应时间是平均的。χ 2检验比较准确率,方差分析比较反应时间。结果:准确度差异显著(p = 0.001)。Gemini Pro(83.1%)和o3届时将提供任何盲法信息。(82.4%)达到了最高的比率,优于gpt - 40(76.9%)。深层副驾驶(77.8%)没有明显超过快速副驾驶(75.1%)。深度推理模型(o3, Gemini Pro)平均每批4-5秒,而轻量级模型(Copilot Quick, Gemini Flash, gpt - 40)的响应时间低于1秒(p < 0.001)。所有的模型都没有解决需要精确流行病学或禁忌症数据的问题。结论:深度推理llm以较长的推理时间为代价提供了优越的种植学MCQ准确性。轻量级模型提供近乎即时的响应,精度略低。将模型选择与任务复杂性相匹配可以优化种植学教育和实践中的速度、成本和诊断精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信