Comparative Performance of Deep-Reasoning and Lightweight Large Language Models on Oral Implantology Multiple-Choice Questions.

IF 1.8

The International journal of prosthodontics Pub Date : 2025-10-02 DOI:10.11607/ijp.9504

Viet Anh Nguyen, Thi Quynh Trang Vuong, Van Hung Nguyen

{"title":"Comparative Performance of Deep-Reasoning and Lightweight Large Language Models on Oral Implantology Multiple-Choice Questions.","authors":"Viet Anh Nguyen, Thi Quynh Trang Vuong, Van Hung Nguyen","doi":"10.11607/ijp.9504","DOIUrl":null,"url":null,"abstract":"Purpose: Large language models (LLMs) show promise in dental education, but their performance on specialized implantology knowledge is unclear. This study compared the accuracy and response times of six LLM configurations on an implantology MCQ dataset to guide optimal model selection.Materials and methods: We administered 675 single-best-answer MCQs from a standard oral implantology question bank to six LLM setups in May 2025, including two OpenAI models (o3 and GPT-4o), two Microsoft Copilot modes (Deep and Quick, both based on o3-mini), and two Google Gemini variants (Flash and Pro). An independent assessor delivered questions in batches of ten using a uniform prompt, recorded each model's answers, and measured elapsed time per batch. Accuracy was the percentage of correct answers; response time was averaged across batches. χ² tests compared accuracy, and ANOVA compared response times.Results: Accuracy varied significantly (p = 0.001). Gemini Pro (83.1%) and o3 Any blinded information will be available then. (82.4%) achieved the highest rates, outperforming GPT-4o (76.9%). Copilot Deep (77.8%) did not significantly exceed Copilot Quick (75.1%). Deep reasoning models (o3, Gemini Pro) averaged 4-5 s per batch, while lightweight variants (Copilot Quick, Gemini Flash, GPT-4o) responded in under 1 s (p < 0.001). All models uniformly failed questions requiring precise epidemiological or contraindication data.Conclusions: Deep-reasoning LLMs deliver superior implantology MCQ accuracy at the cost of modestly longer inference times. Lightweight models offer near-instant responses with slightly lower accuracy. Aligning model choice to task complexity can optimize speed, cost, and diagnostic precision in implantology education and practice.","PeriodicalId":94232,"journal":{"name":"The International journal of prosthodontics","volume":"0 0","pages":"1-20"},"PeriodicalIF":1.8000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International journal of prosthodontics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11607/ijp.9504","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Large language models (LLMs) show promise in dental education, but their performance on specialized implantology knowledge is unclear. This study compared the accuracy and response times of six LLM configurations on an implantology MCQ dataset to guide optimal model selection.

Materials and methods: We administered 675 single-best-answer MCQs from a standard oral implantology question bank to six LLM setups in May 2025, including two OpenAI models (o3 and GPT-4o), two Microsoft Copilot modes (Deep and Quick, both based on o3-mini), and two Google Gemini variants (Flash and Pro). An independent assessor delivered questions in batches of ten using a uniform prompt, recorded each model's answers, and measured elapsed time per batch. Accuracy was the percentage of correct answers; response time was averaged across batches. χ² tests compared accuracy, and ANOVA compared response times.

Results: Accuracy varied significantly (p = 0.001). Gemini Pro (83.1%) and o3 Any blinded information will be available then. (82.4%) achieved the highest rates, outperforming GPT-4o (76.9%). Copilot Deep (77.8%) did not significantly exceed Copilot Quick (75.1%). Deep reasoning models (o3, Gemini Pro) averaged 4-5 s per batch, while lightweight variants (Copilot Quick, Gemini Flash, GPT-4o) responded in under 1 s (p < 0.001). All models uniformly failed questions requiring precise epidemiological or contraindication data.

Conclusions: Deep-reasoning LLMs deliver superior implantology MCQ accuracy at the cost of modestly longer inference times. Lightweight models offer near-instant responses with slightly lower accuracy. Aligning model choice to task complexity can optimize speed, cost, and diagnostic precision in implantology education and practice.

查看原文本刊更多论文

深度推理和轻量级大语言模型在口腔种植学选择题中的比较表现。

目的：大型语言模型（LLMs）在口腔教育中显示出良好的前景，但其在种植专业知识方面的表现尚不清楚。本研究比较了种植医学MCQ数据集上六种LLM配置的准确性和响应时间，以指导最佳模型选择。材料和方法：我们于2025年5月将标准口腔种植学题库中的675个单最佳答案mcq管理到六个LLM设置中，包括两个OpenAI模型（o3和gpt - 40），两个Microsoft Copilot模式（Deep和Quick，均基于o3-mini）和两个谷歌Gemini版本（Flash和Pro）。一个独立的评估员使用统一的提示，以10个为一个批次交付问题，记录每个模型的答案，并测量每个批次的运行时间。准确度是正确答案的百分比；每个批次的响应时间是平均的。χ 2检验比较准确率，方差分析比较反应时间。结果：准确度差异显著（p = 0.001）。Gemini Pro（83.1%）和o3届时将提供任何盲法信息。（82.4%）达到了最高的比率，优于gpt - 40（76.9%）。深层副驾驶（77.8%）没有明显超过快速副驾驶（75.1%）。深度推理模型（o3, Gemini Pro）平均每批4-5秒，而轻量级模型（Copilot Quick, Gemini Flash, gpt - 40）的响应时间低于1秒（p < 0.001）。所有的模型都没有解决需要精确流行病学或禁忌症数据的问题。结论：深度推理llm以较长的推理时间为代价提供了优越的种植学MCQ准确性。轻量级模型提供近乎即时的响应，精度略低。将模型选择与任务复杂性相匹配可以优化种植学教育和实践中的速度、成本和诊断精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The International journal of prosthodontics

自引率

0.00%

发文量