Accuracy and Reliability of Artificial Intelligence Chatbots as Public Information Sources in Implant Dentistry.

The International journal of oral & maxillofacial implants Pub Date : 2025-06-25 DOI:10.11607/jomi.11280

Filiz Yagcı, Ravza Eraslan, Haydar Albayrak, Funda İpekten

{"title":"Accuracy and Reliability of Artificial Intelligence Chatbots as Public Information Sources in Implant Dentistry.","authors":"Filiz Yagcı, Ravza Eraslan, Haydar Albayrak, Funda İpekten","doi":"10.11607/jomi.11280","DOIUrl":null,"url":null,"abstract":"Purpose: The purpose of this study was to evaluate the accuracy, completeness, comprehensibility and reliability of widely available AI chatbots in addressing clinically significant queries pertaining to implant dentistry.Materials and methods: Twenty questions were devised based on those that were most frequently asked or encountered during patient consultations by three experienced prosthodontists. That questions were asked to ChatGPT- 3.5, Gemini, Copilot AI chatbots. All questions were asked to the each chatbot three times with a twelve days intervals and a three-point Likert scale (Grade 0: incorrect, grade 1: incomplete or partially correct, and grade 2: correct) and a two point scale (true and false) were employed by the authors to grade the accuracy of the responses independently. Also completeness and comprehensibility were evaluated using a three-point Likert scale. Frequently asked five questions to each chatbot were analyzed. The comparison of total scores of the chatbots was made with one-way analysis of variance. Two point scale data were analysed by Chi-Square test. The reliability of the responses for each chatbot was analyzed by assessing the consistency of repeated responses by calculating Cronbach's alpha coefficients.Results: When the total scores of the chatbots were analyzed (ChatGPT-3.5 = 28.78 ± 4.06, Gemini = 30.89 ± 4.08, Copilot = 29.11 ± 3.22), one-way ANOVA revealed no statistically significant differences (P=.461). Evaluation of two-point scale data which analysed by Chi-Square test, revealed no statistical difference among the chatbots (P=.336). Gemini has shown higher completeness level than ChatGPT-3.5 (P=.011). There was no statistically significant difference among AI chatbots in terms of comprehensibility. Copilot demonstrated the greatest overall consistency among the three chatbots, with a Cronbach's alpha value of 0.863. This was followed by ChatGPT-3.5 with a Cronbach's alpha value of 0.779 and Gemini with a Cronbach's alpha value of 0.636.Conclusions: The accuracy of three chatbots was found similar. All three chatbots demonstrated an acceptable level of consistency. However, given the low accuracy rate of chatbots in answering questions, it is clear that they should not be the sole decision-maker. The clinician's opinion must be given priority.","PeriodicalId":94230,"journal":{"name":"The International journal of oral & maxillofacial implants","volume":"0 0","pages":"1-23"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International journal of oral & maxillofacial implants","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11607/jomi.11280","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: The purpose of this study was to evaluate the accuracy, completeness, comprehensibility and reliability of widely available AI chatbots in addressing clinically significant queries pertaining to implant dentistry.

Materials and methods: Twenty questions were devised based on those that were most frequently asked or encountered during patient consultations by three experienced prosthodontists. That questions were asked to ChatGPT- 3.5, Gemini, Copilot AI chatbots. All questions were asked to the each chatbot three times with a twelve days intervals and a three-point Likert scale (Grade 0: incorrect, grade 1: incomplete or partially correct, and grade 2: correct) and a two point scale (true and false) were employed by the authors to grade the accuracy of the responses independently. Also completeness and comprehensibility were evaluated using a three-point Likert scale. Frequently asked five questions to each chatbot were analyzed. The comparison of total scores of the chatbots was made with one-way analysis of variance. Two point scale data were analysed by Chi-Square test. The reliability of the responses for each chatbot was analyzed by assessing the consistency of repeated responses by calculating Cronbach's alpha coefficients.

Results: When the total scores of the chatbots were analyzed (ChatGPT-3.5 = 28.78 ± 4.06, Gemini = 30.89 ± 4.08, Copilot = 29.11 ± 3.22), one-way ANOVA revealed no statistically significant differences (P=.461). Evaluation of two-point scale data which analysed by Chi-Square test, revealed no statistical difference among the chatbots (P=.336). Gemini has shown higher completeness level than ChatGPT-3.5 (P=.011). There was no statistically significant difference among AI chatbots in terms of comprehensibility. Copilot demonstrated the greatest overall consistency among the three chatbots, with a Cronbach's alpha value of 0.863. This was followed by ChatGPT-3.5 with a Cronbach's alpha value of 0.779 and Gemini with a Cronbach's alpha value of 0.636.

Conclusions: The accuracy of three chatbots was found similar. All three chatbots demonstrated an acceptable level of consistency. However, given the low accuracy rate of chatbots in answering questions, it is clear that they should not be the sole decision-maker. The clinician's opinion must be given priority.

查看原文本刊更多论文

人工智能聊天机器人作为种植牙科公共信息源的准确性和可靠性。

目的：本研究的目的是评估广泛使用的人工智能聊天机器人在解决与种植牙科有关的临床重要问题时的准确性、完整性、可理解性和可靠性。材料和方法：根据三位经验丰富的义齿医师在患者会诊期间最常被问到或遇到的问题，设计了20个问题。这些问题被问到了ChatGPT- 3.5, Gemini，副驾驶AI聊天机器人。所有的问题都被询问给每个聊天机器人三次，每隔12天，一个3分李克特量表（0级：不正确，1级：不完整或部分正确，2级：正确）和一个2分量表（真和假）被作者独立地对回答的准确性进行评分。此外，完整性和可理解性评估使用三点李克特量表。对每个聊天机器人常问的5个问题进行了分析。对聊天机器人的总分进行单因素方差分析比较。两个点标度数据采用卡方检验进行分析。通过计算Cronbach’s alpha系数来评估重复回答的一致性，从而分析每个聊天机器人回答的可靠性。结果：对聊天机器人的总分进行分析（ChatGPT-3.5 = 28.78±4.06,Gemini = 30.89±4.08,Copilot = 29.11±3.22），单因素方差分析显示，两组机器人的总分差异无统计学意义（P=.461）。通过卡方检验对两点量表数据进行评估，发现聊天机器人之间没有统计学差异（P=.336）。Gemini的完备性水平高于ChatGPT-3.5 （P= 0.011）。人工智能聊天机器人在可理解性方面没有统计学上的显著差异。副驾驶在三个聊天机器人中表现出最大的整体一致性，其Cronbach’s alpha值为0.863。其次是ChatGPT-3.5, Cronbach's alpha值为0.779，Gemini的Cronbach's alpha值为0.636。结论：三种聊天机器人的准确率相似。这三个聊天机器人都表现出了可接受的一致性。然而，考虑到聊天机器人在回答问题时的低准确率，很明显，它们不应该是唯一的决策者。必须优先考虑临床医生的意见。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The International journal of oral & maxillofacial implants

自引率

0.00%

发文量