基于大型语言模型的聊天机器人与临床医生作为正畸信息来源的可靠性:比较分析。

IF 3.1 Q2 DENTISTRY, ORAL SURGERY & MEDICINE
Stefano Martina, Davide Cannatà, Teresa Paduano, Valentina Schettino, Francesco Giordano, Marzio Galdi
{"title":"基于大型语言模型的聊天机器人与临床医生作为正畸信息来源的可靠性:比较分析。","authors":"Stefano Martina, Davide Cannatà, Teresa Paduano, Valentina Schettino, Francesco Giordano, Marzio Galdi","doi":"10.3390/dj13080343","DOIUrl":null,"url":null,"abstract":"<p><p><b>Objectives</b>: The present cross-sectional analysis aimed to investigate whether Large Language Model-based chatbots can be used as reliable sources of information in orthodontics by evaluating chatbot responses and comparing them to those of dental practitioners with different levels of knowledge. <b>Methods</b>: Eight true and false frequently asked orthodontic questions were submitted to five leading chatbots (ChatGPT-4, Claude-3-Opus, Gemini 2.0 Flash Experimental, Microsoft Copilot, and DeepSeek). The consistency of the answers given by chatbots at four different times was assessed using Cronbach's α. Chi-squared test was used to compare chatbot responses with those given by two groups of clinicians, i.e., general dental practitioners (GDPs) and orthodontic specialists (Os) recruited in an online survey via social media, and differences were considered significant when <i>p</i> < 0.05. Additionally, chatbots were asked to provide a justification for their dichotomous responses using a chain-of-through prompting approach and rating the educational value according to the Global Quality Scale (GQS). <b>Results</b>: A high degree of consistency in answering was found for all analyzed chatbots (α > 0.80). When comparing chatbot answers with GDP and O ones, statistically significant differences were found for almost all the questions (<i>p</i> < 0.05). When evaluating the educational value of chatbot responses, DeepSeek achieved the highest GQS score (median 4.00; interquartile range 0.00), whereas CoPilot had the lowest one (median 2.00; interquartile range 2.00). <b>Conclusions</b>: Although chatbots yield somewhat useful information about orthodontics, they can provide misleading information when dealing with controversial topics.</p>","PeriodicalId":11269,"journal":{"name":"Dentistry Journal","volume":"13 8","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12385111/pdf/","citationCount":"0","resultStr":"{\"title\":\"Reliability of Large Language Model-Based Chatbots Versus Clinicians as Sources of Information on Orthodontics: A Comparative Analysis.\",\"authors\":\"Stefano Martina, Davide Cannatà, Teresa Paduano, Valentina Schettino, Francesco Giordano, Marzio Galdi\",\"doi\":\"10.3390/dj13080343\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p><b>Objectives</b>: The present cross-sectional analysis aimed to investigate whether Large Language Model-based chatbots can be used as reliable sources of information in orthodontics by evaluating chatbot responses and comparing them to those of dental practitioners with different levels of knowledge. <b>Methods</b>: Eight true and false frequently asked orthodontic questions were submitted to five leading chatbots (ChatGPT-4, Claude-3-Opus, Gemini 2.0 Flash Experimental, Microsoft Copilot, and DeepSeek). The consistency of the answers given by chatbots at four different times was assessed using Cronbach's α. Chi-squared test was used to compare chatbot responses with those given by two groups of clinicians, i.e., general dental practitioners (GDPs) and orthodontic specialists (Os) recruited in an online survey via social media, and differences were considered significant when <i>p</i> < 0.05. Additionally, chatbots were asked to provide a justification for their dichotomous responses using a chain-of-through prompting approach and rating the educational value according to the Global Quality Scale (GQS). <b>Results</b>: A high degree of consistency in answering was found for all analyzed chatbots (α > 0.80). When comparing chatbot answers with GDP and O ones, statistically significant differences were found for almost all the questions (<i>p</i> < 0.05). When evaluating the educational value of chatbot responses, DeepSeek achieved the highest GQS score (median 4.00; interquartile range 0.00), whereas CoPilot had the lowest one (median 2.00; interquartile range 2.00). <b>Conclusions</b>: Although chatbots yield somewhat useful information about orthodontics, they can provide misleading information when dealing with controversial topics.</p>\",\"PeriodicalId\":11269,\"journal\":{\"name\":\"Dentistry Journal\",\"volume\":\"13 8\",\"pages\":\"\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12385111/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Dentistry Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/dj13080343\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dentistry Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/dj13080343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

摘要

目的:本横断面分析旨在通过评估聊天机器人的反应,并将其与不同知识水平的牙科医生的反应进行比较,探讨基于大语言模型的聊天机器人是否可以作为正畸治疗的可靠信息来源。方法:向ChatGPT-4、Claude-3-Opus、Gemini 2.0 Flash Experimental、Microsoft Copilot和DeepSeek这5个主流聊天机器人提交8个正畸常见问题的真假对照。使用Cronbach’s α来评估聊天机器人在四个不同时间给出的答案的一致性。使用卡方检验将聊天机器人的回答与两组临床医生(即通过社交媒体在线调查招募的普通牙科医生(GDPs)和正畸专家(o))的回答进行比较,当p < 0.05时认为差异有统计学意义。此外,聊天机器人被要求使用链式提示方法为其二分式回答提供理由,并根据全球质量量表(GQS)对教育价值进行评级。结果:所有被分析的聊天机器人的回答都有高度的一致性(α > 0.80)。将聊天机器人的回答与GDP和O的回答进行比较,几乎所有的问题都有统计学差异(p < 0.05)。在评估聊天机器人响应的教育价值时,DeepSeek的GQS得分最高(中位数4.00,四分位数范围0.00),而CoPilot的GQS得分最低(中位数2.00,四分位数范围2.00)。结论:尽管聊天机器人可以提供一些关于正畸的有用信息,但在处理有争议的话题时,它们可能会提供误导性信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Reliability of Large Language Model-Based Chatbots Versus Clinicians as Sources of Information on Orthodontics: A Comparative Analysis.

Reliability of Large Language Model-Based Chatbots Versus Clinicians as Sources of Information on Orthodontics: A Comparative Analysis.

Reliability of Large Language Model-Based Chatbots Versus Clinicians as Sources of Information on Orthodontics: A Comparative Analysis.

Objectives: The present cross-sectional analysis aimed to investigate whether Large Language Model-based chatbots can be used as reliable sources of information in orthodontics by evaluating chatbot responses and comparing them to those of dental practitioners with different levels of knowledge. Methods: Eight true and false frequently asked orthodontic questions were submitted to five leading chatbots (ChatGPT-4, Claude-3-Opus, Gemini 2.0 Flash Experimental, Microsoft Copilot, and DeepSeek). The consistency of the answers given by chatbots at four different times was assessed using Cronbach's α. Chi-squared test was used to compare chatbot responses with those given by two groups of clinicians, i.e., general dental practitioners (GDPs) and orthodontic specialists (Os) recruited in an online survey via social media, and differences were considered significant when p < 0.05. Additionally, chatbots were asked to provide a justification for their dichotomous responses using a chain-of-through prompting approach and rating the educational value according to the Global Quality Scale (GQS). Results: A high degree of consistency in answering was found for all analyzed chatbots (α > 0.80). When comparing chatbot answers with GDP and O ones, statistically significant differences were found for almost all the questions (p < 0.05). When evaluating the educational value of chatbot responses, DeepSeek achieved the highest GQS score (median 4.00; interquartile range 0.00), whereas CoPilot had the lowest one (median 2.00; interquartile range 2.00). Conclusions: Although chatbots yield somewhat useful information about orthodontics, they can provide misleading information when dealing with controversial topics.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Dentistry Journal
Dentistry Journal Dentistry-Dentistry (all)
CiteScore
3.70
自引率
7.70%
发文量
213
审稿时长
11 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信