Viet Anh Nguyen, Thi Quynh Trang Vuong, Van Hung Nguyen
{"title":"深度推理和轻量级大语言模型在口腔种植学选择题中的比较表现。","authors":"Viet Anh Nguyen, Thi Quynh Trang Vuong, Van Hung Nguyen","doi":"10.11607/ijp.9504","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs) show promise in dental education, but their performance on specialized implantology knowledge is unclear. This study compared the accuracy and response times of six LLM configurations on an implantology MCQ dataset to guide optimal model selection.</p><p><strong>Materials and methods: </strong>We administered 675 single-best-answer MCQs from a standard oral implantology question bank to six LLM setups in May 2025, including two OpenAI models (o3 and GPT-4o), two Microsoft Copilot modes (Deep and Quick, both based on o3-mini), and two Google Gemini variants (Flash and Pro). An independent assessor delivered questions in batches of ten using a uniform prompt, recorded each model's answers, and measured elapsed time per batch. Accuracy was the percentage of correct answers; response time was averaged across batches. χ² tests compared accuracy, and ANOVA compared response times.</p><p><strong>Results: </strong>Accuracy varied significantly (p = 0.001). Gemini Pro (83.1%) and o3 Any blinded information will be available then. (82.4%) achieved the highest rates, outperforming GPT-4o (76.9%). Copilot Deep (77.8%) did not significantly exceed Copilot Quick (75.1%). Deep reasoning models (o3, Gemini Pro) averaged 4-5 s per batch, while lightweight variants (Copilot Quick, Gemini Flash, GPT-4o) responded in under 1 s (p < 0.001). All models uniformly failed questions requiring precise epidemiological or contraindication data.</p><p><strong>Conclusions: </strong>Deep-reasoning LLMs deliver superior implantology MCQ accuracy at the cost of modestly longer inference times. Lightweight models offer near-instant responses with slightly lower accuracy. Aligning model choice to task complexity can optimize speed, cost, and diagnostic precision in implantology education and practice.</p>","PeriodicalId":94232,"journal":{"name":"The International journal of prosthodontics","volume":"0 0","pages":"1-20"},"PeriodicalIF":1.8000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparative Performance of Deep-Reasoning and Lightweight Large Language Models on Oral Implantology Multiple-Choice Questions.\",\"authors\":\"Viet Anh Nguyen, Thi Quynh Trang Vuong, Van Hung Nguyen\",\"doi\":\"10.11607/ijp.9504\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Large language models (LLMs) show promise in dental education, but their performance on specialized implantology knowledge is unclear. This study compared the accuracy and response times of six LLM configurations on an implantology MCQ dataset to guide optimal model selection.</p><p><strong>Materials and methods: </strong>We administered 675 single-best-answer MCQs from a standard oral implantology question bank to six LLM setups in May 2025, including two OpenAI models (o3 and GPT-4o), two Microsoft Copilot modes (Deep and Quick, both based on o3-mini), and two Google Gemini variants (Flash and Pro). An independent assessor delivered questions in batches of ten using a uniform prompt, recorded each model's answers, and measured elapsed time per batch. Accuracy was the percentage of correct answers; response time was averaged across batches. χ² tests compared accuracy, and ANOVA compared response times.</p><p><strong>Results: </strong>Accuracy varied significantly (p = 0.001). Gemini Pro (83.1%) and o3 Any blinded information will be available then. (82.4%) achieved the highest rates, outperforming GPT-4o (76.9%). Copilot Deep (77.8%) did not significantly exceed Copilot Quick (75.1%). Deep reasoning models (o3, Gemini Pro) averaged 4-5 s per batch, while lightweight variants (Copilot Quick, Gemini Flash, GPT-4o) responded in under 1 s (p < 0.001). All models uniformly failed questions requiring precise epidemiological or contraindication data.</p><p><strong>Conclusions: </strong>Deep-reasoning LLMs deliver superior implantology MCQ accuracy at the cost of modestly longer inference times. Lightweight models offer near-instant responses with slightly lower accuracy. Aligning model choice to task complexity can optimize speed, cost, and diagnostic precision in implantology education and practice.</p>\",\"PeriodicalId\":94232,\"journal\":{\"name\":\"The International journal of prosthodontics\",\"volume\":\"0 0\",\"pages\":\"1-20\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The International journal of prosthodontics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.11607/ijp.9504\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International journal of prosthodontics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11607/ijp.9504","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Comparative Performance of Deep-Reasoning and Lightweight Large Language Models on Oral Implantology Multiple-Choice Questions.
Purpose: Large language models (LLMs) show promise in dental education, but their performance on specialized implantology knowledge is unclear. This study compared the accuracy and response times of six LLM configurations on an implantology MCQ dataset to guide optimal model selection.
Materials and methods: We administered 675 single-best-answer MCQs from a standard oral implantology question bank to six LLM setups in May 2025, including two OpenAI models (o3 and GPT-4o), two Microsoft Copilot modes (Deep and Quick, both based on o3-mini), and two Google Gemini variants (Flash and Pro). An independent assessor delivered questions in batches of ten using a uniform prompt, recorded each model's answers, and measured elapsed time per batch. Accuracy was the percentage of correct answers; response time was averaged across batches. χ² tests compared accuracy, and ANOVA compared response times.
Results: Accuracy varied significantly (p = 0.001). Gemini Pro (83.1%) and o3 Any blinded information will be available then. (82.4%) achieved the highest rates, outperforming GPT-4o (76.9%). Copilot Deep (77.8%) did not significantly exceed Copilot Quick (75.1%). Deep reasoning models (o3, Gemini Pro) averaged 4-5 s per batch, while lightweight variants (Copilot Quick, Gemini Flash, GPT-4o) responded in under 1 s (p < 0.001). All models uniformly failed questions requiring precise epidemiological or contraindication data.
Conclusions: Deep-reasoning LLMs deliver superior implantology MCQ accuracy at the cost of modestly longer inference times. Lightweight models offer near-instant responses with slightly lower accuracy. Aligning model choice to task complexity can optimize speed, cost, and diagnostic precision in implantology education and practice.