Performance of large language models in fluoride-related dental knowledge: a comparative evaluation study of ChatGPT-4, Claude 3.5 Sonnet, Copilot, and Grok 3.
{"title":"Performance of large language models in fluoride-related dental knowledge: a comparative evaluation study of ChatGPT-4, Claude 3.5 Sonnet, Copilot, and Grok 3.","authors":"Raju Biswas, Atanu Mukhopadhyay, Santanu Mukhopadhyay","doi":"10.12701/jyms.2025.42.53","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have rapidly emerged as valuable tools in medical and dental education that support clinical reasoning, patient communication, and academic instruction. However, their effectiveness in conveying specialized content, such as fluoride-related dental knowledge, requires a thorough evaluation. This study assesses the performance of four advanced LLMs-ChatGPT-4 (OpenAI), Claude 3.5 Sonnet (Anthropic), Microsoft Copilot, and Grok 3 (xAI)-in addressing fluoride-related topics in dentistry.</p><p><strong>Methods: </strong>A cross-sectional comparative analysis was conducted using a mixed methods framework. Fifty multiple-choice questions (MCQs) and ten open-ended questions covering fluoride chemistry, clinical applications, and safety concerns were administered to each LLM. Open-ended responses were scored by two blinded expert raters using a four-dimensional rubric of accuracy, depth, clarity, and evidence. Interrater agreement was assessed using Cohen's kappa and Spearman's rank correlation tests. Statistical comparisons were performed using analysis of variance, Kruskal-Wallis, and post-hoc tests.</p><p><strong>Results: </strong>All models demonstrated high MCQ accuracy (88%-94%). Claude 3.5 Sonnet consistently achieved the highest average scores for the open-ended responses, particularly in the clarity dimension, with a statistically significant difference (p=0.009). Minor differences of 0.1 to 0.6 points between models in accuracy, depth, and evidence dimensions were observed but did not reach statistical significance. Despite minor differences across dimensions, all LLMs exhibited a strong performance in conveying fluoride-related dental content. Interrater agreement in model rankings was generally strong, supporting the reliability of the comparative outcomes.</p><p><strong>Conclusion: </strong>Advanced LLMs have substantial potential as supplementary tools for dental education and patient communication regarding fluoride use. Claude 3.5 Sonnet showed a notable advantage in terms of linguistic clarity, highlighting its value in educational contexts. Ongoing evaluation, clinical validation, and oversight are essential to ensure safe and effective integration of LLM into dentistry.</p>","PeriodicalId":74020,"journal":{"name":"Journal of Yeungnam medical science","volume":"42 ","pages":"53"},"PeriodicalIF":1.4000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Yeungnam medical science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12701/jyms.2025.42.53","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large language models (LLMs) have rapidly emerged as valuable tools in medical and dental education that support clinical reasoning, patient communication, and academic instruction. However, their effectiveness in conveying specialized content, such as fluoride-related dental knowledge, requires a thorough evaluation. This study assesses the performance of four advanced LLMs-ChatGPT-4 (OpenAI), Claude 3.5 Sonnet (Anthropic), Microsoft Copilot, and Grok 3 (xAI)-in addressing fluoride-related topics in dentistry.
Methods: A cross-sectional comparative analysis was conducted using a mixed methods framework. Fifty multiple-choice questions (MCQs) and ten open-ended questions covering fluoride chemistry, clinical applications, and safety concerns were administered to each LLM. Open-ended responses were scored by two blinded expert raters using a four-dimensional rubric of accuracy, depth, clarity, and evidence. Interrater agreement was assessed using Cohen's kappa and Spearman's rank correlation tests. Statistical comparisons were performed using analysis of variance, Kruskal-Wallis, and post-hoc tests.
Results: All models demonstrated high MCQ accuracy (88%-94%). Claude 3.5 Sonnet consistently achieved the highest average scores for the open-ended responses, particularly in the clarity dimension, with a statistically significant difference (p=0.009). Minor differences of 0.1 to 0.6 points between models in accuracy, depth, and evidence dimensions were observed but did not reach statistical significance. Despite minor differences across dimensions, all LLMs exhibited a strong performance in conveying fluoride-related dental content. Interrater agreement in model rankings was generally strong, supporting the reliability of the comparative outcomes.
Conclusion: Advanced LLMs have substantial potential as supplementary tools for dental education and patient communication regarding fluoride use. Claude 3.5 Sonnet showed a notable advantage in terms of linguistic clarity, highlighting its value in educational contexts. Ongoing evaluation, clinical validation, and oversight are essential to ensure safe and effective integration of LLM into dentistry.