Reliability of Large Language Model-Based Chatbots Versus Clinicians as Sources of Information on Orthodontics: A Comparative Analysis.

IF 3.1 Q2 DENTISTRY, ORAL SURGERY & MEDICINE

Dentistry Journal Pub Date : 2025-07-24 DOI:10.3390/dj13080343

Stefano Martina, Davide Cannatà, Teresa Paduano, Valentina Schettino, Francesco Giordano, Marzio Galdi

{"title":"Reliability of Large Language Model-Based Chatbots Versus Clinicians as Sources of Information on Orthodontics: A Comparative Analysis.","authors":"Stefano Martina, Davide Cannatà, Teresa Paduano, Valentina Schettino, Francesco Giordano, Marzio Galdi","doi":"10.3390/dj13080343","DOIUrl":null,"url":null,"abstract":"Objectives: The present cross-sectional analysis aimed to investigate whether Large Language Model-based chatbots can be used as reliable sources of information in orthodontics by evaluating chatbot responses and comparing them to those of dental practitioners with different levels of knowledge. Methods: Eight true and false frequently asked orthodontic questions were submitted to five leading chatbots (ChatGPT-4, Claude-3-Opus, Gemini 2.0 Flash Experimental, Microsoft Copilot, and DeepSeek). The consistency of the answers given by chatbots at four different times was assessed using Cronbach's α. Chi-squared test was used to compare chatbot responses with those given by two groups of clinicians, i.e., general dental practitioners (GDPs) and orthodontic specialists (Os) recruited in an online survey via social media, and differences were considered significant when p < 0.05. Additionally, chatbots were asked to provide a justification for their dichotomous responses using a chain-of-through prompting approach and rating the educational value according to the Global Quality Scale (GQS). Results: A high degree of consistency in answering was found for all analyzed chatbots (α > 0.80). When comparing chatbot answers with GDP and O ones, statistically significant differences were found for almost all the questions (p < 0.05). When evaluating the educational value of chatbot responses, DeepSeek achieved the highest GQS score (median 4.00; interquartile range 0.00), whereas CoPilot had the lowest one (median 2.00; interquartile range 2.00). Conclusions: Although chatbots yield somewhat useful information about orthodontics, they can provide misleading information when dealing with controversial topics.","PeriodicalId":11269,"journal":{"name":"Dentistry Journal","volume":"13 8","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12385111/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dentistry Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/dj13080343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: The present cross-sectional analysis aimed to investigate whether Large Language Model-based chatbots can be used as reliable sources of information in orthodontics by evaluating chatbot responses and comparing them to those of dental practitioners with different levels of knowledge. Methods: Eight true and false frequently asked orthodontic questions were submitted to five leading chatbots (ChatGPT-4, Claude-3-Opus, Gemini 2.0 Flash Experimental, Microsoft Copilot, and DeepSeek). The consistency of the answers given by chatbots at four different times was assessed using Cronbach's α. Chi-squared test was used to compare chatbot responses with those given by two groups of clinicians, i.e., general dental practitioners (GDPs) and orthodontic specialists (Os) recruited in an online survey via social media, and differences were considered significant when p < 0.05. Additionally, chatbots were asked to provide a justification for their dichotomous responses using a chain-of-through prompting approach and rating the educational value according to the Global Quality Scale (GQS). Results: A high degree of consistency in answering was found for all analyzed chatbots (α > 0.80). When comparing chatbot answers with GDP and O ones, statistically significant differences were found for almost all the questions (p < 0.05). When evaluating the educational value of chatbot responses, DeepSeek achieved the highest GQS score (median 4.00; interquartile range 0.00), whereas CoPilot had the lowest one (median 2.00; interquartile range 2.00). Conclusions: Although chatbots yield somewhat useful information about orthodontics, they can provide misleading information when dealing with controversial topics.

Abstract Image

查看原文本刊更多论文

基于大型语言模型的聊天机器人与临床医生作为正畸信息来源的可靠性：比较分析。

目的：本横断面分析旨在通过评估聊天机器人的反应，并将其与不同知识水平的牙科医生的反应进行比较，探讨基于大语言模型的聊天机器人是否可以作为正畸治疗的可靠信息来源。方法：向ChatGPT-4、Claude-3-Opus、Gemini 2.0 Flash Experimental、Microsoft Copilot和DeepSeek这5个主流聊天机器人提交8个正畸常见问题的真假对照。使用Cronbach’s α来评估聊天机器人在四个不同时间给出的答案的一致性。使用卡方检验将聊天机器人的回答与两组临床医生（即通过社交媒体在线调查招募的普通牙科医生（GDPs）和正畸专家(o)）的回答进行比较，当p < 0.05时认为差异有统计学意义。此外，聊天机器人被要求使用链式提示方法为其二分式回答提供理由，并根据全球质量量表（GQS）对教育价值进行评级。结果：所有被分析的聊天机器人的回答都有高度的一致性（α > 0.80）。将聊天机器人的回答与GDP和O的回答进行比较，几乎所有的问题都有统计学差异（p < 0.05）。在评估聊天机器人响应的教育价值时，DeepSeek的GQS得分最高（中位数4.00，四分位数范围0.00），而CoPilot的GQS得分最低（中位数2.00，四分位数范围2.00）。结论：尽管聊天机器人可以提供一些关于正畸的有用信息，但在处理有争议的话题时，它们可能会提供误导性信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊