Performance of large language models in fluoride-related dental knowledge: a comparative evaluation study of ChatGPT-4, Claude 3.5 Sonnet, Copilot, and Grok 3.

IF 1.4 Q3 MEDICINE, GENERAL & INTERNAL

Journal of Yeungnam medical science Pub Date : 2025-01-01 Epub Date: 2025-09-01 DOI:10.12701/jyms.2025.42.53

Raju Biswas, Atanu Mukhopadhyay, Santanu Mukhopadhyay

{"title":"Performance of large language models in fluoride-related dental knowledge: a comparative evaluation study of ChatGPT-4, Claude 3.5 Sonnet, Copilot, and Grok 3.","authors":"Raju Biswas, Atanu Mukhopadhyay, Santanu Mukhopadhyay","doi":"10.12701/jyms.2025.42.53","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) have rapidly emerged as valuable tools in medical and dental education that support clinical reasoning, patient communication, and academic instruction. However, their effectiveness in conveying specialized content, such as fluoride-related dental knowledge, requires a thorough evaluation. This study assesses the performance of four advanced LLMs-ChatGPT-4 (OpenAI), Claude 3.5 Sonnet (Anthropic), Microsoft Copilot, and Grok 3 (xAI)-in addressing fluoride-related topics in dentistry.Methods: A cross-sectional comparative analysis was conducted using a mixed methods framework. Fifty multiple-choice questions (MCQs) and ten open-ended questions covering fluoride chemistry, clinical applications, and safety concerns were administered to each LLM. Open-ended responses were scored by two blinded expert raters using a four-dimensional rubric of accuracy, depth, clarity, and evidence. Interrater agreement was assessed using Cohen's kappa and Spearman's rank correlation tests. Statistical comparisons were performed using analysis of variance, Kruskal-Wallis, and post-hoc tests.Results: All models demonstrated high MCQ accuracy (88%-94%). Claude 3.5 Sonnet consistently achieved the highest average scores for the open-ended responses, particularly in the clarity dimension, with a statistically significant difference (p=0.009). Minor differences of 0.1 to 0.6 points between models in accuracy, depth, and evidence dimensions were observed but did not reach statistical significance. Despite minor differences across dimensions, all LLMs exhibited a strong performance in conveying fluoride-related dental content. Interrater agreement in model rankings was generally strong, supporting the reliability of the comparative outcomes.Conclusion: Advanced LLMs have substantial potential as supplementary tools for dental education and patient communication regarding fluoride use. Claude 3.5 Sonnet showed a notable advantage in terms of linguistic clarity, highlighting its value in educational contexts. Ongoing evaluation, clinical validation, and oversight are essential to ensure safe and effective integration of LLM into dentistry.","PeriodicalId":74020,"journal":{"name":"Journal of Yeungnam medical science","volume":"42 ","pages":"53"},"PeriodicalIF":1.4000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Yeungnam medical science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12701/jyms.2025.42.53","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) have rapidly emerged as valuable tools in medical and dental education that support clinical reasoning, patient communication, and academic instruction. However, their effectiveness in conveying specialized content, such as fluoride-related dental knowledge, requires a thorough evaluation. This study assesses the performance of four advanced LLMs-ChatGPT-4 (OpenAI), Claude 3.5 Sonnet (Anthropic), Microsoft Copilot, and Grok 3 (xAI)-in addressing fluoride-related topics in dentistry.

Methods: A cross-sectional comparative analysis was conducted using a mixed methods framework. Fifty multiple-choice questions (MCQs) and ten open-ended questions covering fluoride chemistry, clinical applications, and safety concerns were administered to each LLM. Open-ended responses were scored by two blinded expert raters using a four-dimensional rubric of accuracy, depth, clarity, and evidence. Interrater agreement was assessed using Cohen's kappa and Spearman's rank correlation tests. Statistical comparisons were performed using analysis of variance, Kruskal-Wallis, and post-hoc tests.

Results: All models demonstrated high MCQ accuracy (88%-94%). Claude 3.5 Sonnet consistently achieved the highest average scores for the open-ended responses, particularly in the clarity dimension, with a statistically significant difference (p=0.009). Minor differences of 0.1 to 0.6 points between models in accuracy, depth, and evidence dimensions were observed but did not reach statistical significance. Despite minor differences across dimensions, all LLMs exhibited a strong performance in conveying fluoride-related dental content. Interrater agreement in model rankings was generally strong, supporting the reliability of the comparative outcomes.

Conclusion: Advanced LLMs have substantial potential as supplementary tools for dental education and patient communication regarding fluoride use. Claude 3.5 Sonnet showed a notable advantage in terms of linguistic clarity, highlighting its value in educational contexts. Ongoing evaluation, clinical validation, and oversight are essential to ensure safe and effective integration of LLM into dentistry.

查看原文本刊更多论文

大型语言模型在氟化物相关牙科知识中的表现：ChatGPT-4、Claude 3.5 Sonnet、Copilot和Grok 3的比较评价研究

背景：大型语言模型（llm）在医学和牙科教育中迅速成为有价值的工具，支持临床推理、患者交流和学术指导。然而，它们在传达专业内容（如与氟化物有关的牙科知识）方面的有效性需要进行彻底的评估。本研究评估了四个高级LLMs-ChatGPT-4 (OpenAI), Claude 3.5 Sonnet (Anthropic)， Microsoft Copilot和Grok 3 (xAI)-在解决牙科氟化物相关主题方面的性能。方法：采用混合方法框架进行横断面比较分析。对每位法学硕士进行了50道选择题（mcq）和10道开放式问题，内容涉及氟化物化学、临床应用和安全问题。开放式回答由两名盲法专家评分员使用准确性、深度、清晰度和证据的四维标准评分。使用Cohen's kappa和Spearman's秩相关检验评估译员间的一致性。采用方差分析、Kruskal-Wallis和事后检验进行统计比较。结果：所有模型均具有较高的MCQ准确率（88%-94%）。克劳德3.5十四行诗在开放式回答中一直获得最高的平均分，特别是在清晰度维度上，差异具有统计学意义（p=0.009）。模型之间在准确性、深度和证据维度上有0.1 ~ 0.6点的微小差异，但没有达到统计学意义。尽管各维度之间存在微小差异，但所有llm在输送与氟化物相关的牙科内容物方面表现出很强的性能。评估者在模型排名上的一致性通常很强，这支持了比较结果的可靠性。结论：先进的LLMs有很大的潜力作为牙科教育和氟使用患者沟通的辅助工具。克劳德的十四行诗在语言清晰度方面表现出显着的优势，突出了其在教育背景中的价值。持续的评估，临床验证和监督是必不可少的，以确保法学硕士安全有效地整合到牙科。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Yeungnam medical science

CiteScore

0.80

自引率

0.00%

发文量