Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions

IF 3 2区医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE

American Journal of Orthodontics and Dentofacial Orthopedics Pub Date : 2025-05-06 DOI:10.1016/j.ajodo.2025.04.008

Ufuk Metin, Merve Goymen

{"title":"Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions","authors":"Ufuk Metin, Merve Goymen","doi":"10.1016/j.ajodo.2025.04.008","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>This study aimed to investigate whether artificial intelligence (AI)-based chatbots can be used as reliable adjunct tools in orthodontic practice by evaluating chatbot responses and comparing them to those of clinicians with varying levels of knowledge.</div></div><div><h3>Methods</h3><div>Large language model-based chatbots (ChatGPT-4, ChatGPT-4o, Microsoft Copilot, Google Gemini 1.5 Pro, and Claude 3.5 Sonnet) and clinicians (dental students, general dentists, and orthodontists; n = 30) were included. The groups were asked 40 true and false questions, and the accuracy rate for each question was assessed by comparing it to the predetermined answer key. The total score was converted into a percentage. The Kruskal-Wallis test and Dunn’s multiple comparison tests were used to compare accuracy rates. The consistency of the answers given by chatbots at 3 different times was assessed by Cronbach α.</div></div><div><h3>Results</h3><div>The accuracy ratio scores for students were significantly lower than Microsoft Copilot (<em>P</em> = 0.029), Claude 3.5 Sonnet (<em>P</em> = 0.023), ChatGPT-4o (<em>P</em> = 0.005), and orthodontists (<em>P</em> = 0.001). For dentists, the accuracy ratio scores were found to be significantly lower than ChatGPT-4o (<em>P</em> = 0.019) and orthodontists (<em>P</em> = 0.001). The accuracy rate of ChatGPT-4o was closest to that of orthodontists, whereas the accuracy rates of ChatGPT-4, Microsoft Copilot, Claude 3.5 Sonnet, and Google Gemini 1.5 Pro were lower than orthodontists but higher than general dentists. Although ChatGPT-4 demonstrated a high degree of consistency in its responses, evidenced by a high Cronbach α value (α = 0.867), ChatGPT-4o (α = 0.256) and Claude 3.5 Sonnet (α = 0.256) were the least consistent chatbots.</div></div><div><h3>Conclusions</h3><div>The study found that orthodontists had the highest accuracy rate, whereas AI-based chatbots had a higher accuracy rate compared with dental students and general dentists. However, ChatGPT-4 gave the most consistent answers, whereas ChatGPT-4o and Claude 3.5 Sonnet showed the least consistency. AI-based chatbots can be useful for patient education and general orthodontic guidance, but a lack of consistency in responses can lead to the risk of misinformation.</div></div>","PeriodicalId":50806,"journal":{"name":"American Journal of Orthodontics and Dentofacial Orthopedics","volume":"168 3","pages":"Pages 348-357"},"PeriodicalIF":3.0000,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Orthodontics and Dentofacial Orthopedics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0889540625001568","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

This study aimed to investigate whether artificial intelligence (AI)-based chatbots can be used as reliable adjunct tools in orthodontic practice by evaluating chatbot responses and comparing them to those of clinicians with varying levels of knowledge.

Methods

Large language model-based chatbots (ChatGPT-4, ChatGPT-4o, Microsoft Copilot, Google Gemini 1.5 Pro, and Claude 3.5 Sonnet) and clinicians (dental students, general dentists, and orthodontists; n = 30) were included. The groups were asked 40 true and false questions, and the accuracy rate for each question was assessed by comparing it to the predetermined answer key. The total score was converted into a percentage. The Kruskal-Wallis test and Dunn’s multiple comparison tests were used to compare accuracy rates. The consistency of the answers given by chatbots at 3 different times was assessed by Cronbach α.

Results

The accuracy ratio scores for students were significantly lower than Microsoft Copilot (P = 0.029), Claude 3.5 Sonnet (P = 0.023), ChatGPT-4o (P = 0.005), and orthodontists (P = 0.001). For dentists, the accuracy ratio scores were found to be significantly lower than ChatGPT-4o (P = 0.019) and orthodontists (P = 0.001). The accuracy rate of ChatGPT-4o was closest to that of orthodontists, whereas the accuracy rates of ChatGPT-4, Microsoft Copilot, Claude 3.5 Sonnet, and Google Gemini 1.5 Pro were lower than orthodontists but higher than general dentists. Although ChatGPT-4 demonstrated a high degree of consistency in its responses, evidenced by a high Cronbach α value (α = 0.867), ChatGPT-4o (α = 0.256) and Claude 3.5 Sonnet (α = 0.256) were the least consistent chatbots.

Conclusions

The study found that orthodontists had the highest accuracy rate, whereas AI-based chatbots had a higher accuracy rate compared with dental students and general dentists. However, ChatGPT-4 gave the most consistent answers, whereas ChatGPT-4o and Claude 3.5 Sonnet showed the least consistency. AI-based chatbots can be useful for patient education and general orthodontic guidance, but a lack of consistency in responses can lead to the risk of misinformation.

查看原文本刊更多论文

来自数字和人力资源的信息：聊天机器人和临床医生对正畸问题的反应的比较。

本研究旨在通过评估聊天机器人的反应，并将其与不同知识水平的临床医生的反应进行比较，研究基于人工智能（AI）的聊天机器人是否可以作为正畸实践中可靠的辅助工具。方法：基于大型语言模型的聊天机器人（ChatGPT-4、chatgpt - 40、Microsoft Copilot、谷歌Gemini 1.5 Pro和Claude 3.5 Sonnet）和临床医生(牙科学生、普通牙医和正畸医师；N = 30)。这些小组被问了40个真假问题，每个问题的正确率是通过与预定的答案进行比较来评估的。总分被转换成百分数。使用Kruskal-Wallis测试和Dunn的多重比较测试来比较准确率。通过Cronbach α评估聊天机器人在3个不同时间给出的答案的一致性。结果：学生正确率得分显著低于Microsoft Copilot （P = 0.029）、Claude 3.5 Sonnet （P = 0.023）、chatgpt - 40 （P = 0.005）和正畸医师（P = 0.001）。对于牙医来说，正确率得分明显低于chatgpt - 40 （P = 0.019）和正畸医师（P = 0.001）。chatgpt - 40的准确率与正畸医师最接近，而ChatGPT-4、Microsoft Copilot、Claude 3.5 Sonnet、谷歌Gemini 1.5 Pro的准确率低于正畸医师，高于普通牙医。虽然ChatGPT-4在其响应中表现出高度的一致性，证明了高Cronbach α值（α = 0.867），但chatgpt - 40 （α = 0.256）和Claude 3.5 Sonnet （α = 0.256）是一致性最低的聊天机器人。结论：研究发现，正畸医生的准确率最高，而基于人工智能的聊天机器人与牙科学生和普通牙医相比准确率更高。然而，ChatGPT-4给出的答案是最一致的，而chatgpt - 40和Claude 3.5 Sonnet的答案是最不一致的。基于人工智能的聊天机器人可以用于患者教育和一般正畸指导，但反应缺乏一致性可能会导致错误信息的风险。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

American Journal of Orthodontics and Dentofacial Orthopedics 医学-牙科与口腔外科

CiteScore

4.80

自引率

13.30%

发文量

432

审稿时长

66 days

期刊介绍： Published for more than 100 years, the American Journal of Orthodontics and Dentofacial Orthopedics remains the leading orthodontic resource. It is the official publication of the American Association of Orthodontists, its constituent societies, the American Board of Orthodontics, and the College of Diplomates of the American Board of Orthodontics. Each month its readers have access to original peer-reviewed articles that examine all phases of orthodontic treatment. Illustrated throughout, the publication includes tables, color photographs, and statistical data. Coverage includes successful diagnostic procedures, imaging techniques, bracket and archwire materials, extraction and impaction concerns, orthognathic surgery, TMJ disorders, removable appliances, and adult therapy.