Nuran Özyemişci, Bilge Turhan Bal, Merve Bankoğlu Güngör, Esra Kaynak Öztürk, Ayşegül Canvar, Secil Karakoca Nemli
{"title":"Evaluation of information provided by artificial intelligence chatbots on extraoral maxillofacial prostheses.","authors":"Nuran Özyemişci, Bilge Turhan Bal, Merve Bankoğlu Güngör, Esra Kaynak Öztürk, Ayşegül Canvar, Secil Karakoca Nemli","doi":"10.1016/j.prosdent.2025.08.028","DOIUrl":null,"url":null,"abstract":"<p><strong>Statement of problem: </strong>Despite advances in artificial intelligence (AI), the quality, reliability, and understandability of health-related information provided by chatbots is still a question mark. Furthermore, studies on maxillofacial prosthesis (MP) information from AI chatbots are lacking.</p><p><strong>Purpose: </strong>The purpose of this study was to assess and compare the reliability, quality, readability, and similarity of responses to MP-related questions generated by 4 different chatbots.</p><p><strong>Material and methods: </strong>A total of 15 questions were provided by a maxillofacial prosthodontist and from 4 different chatbots (ChatGPT-3.5, Gemini 2.5 Flash, Copilot, and DeepSeek V3). The Reliability Scoring (adapted DISCERN), the Global Quality Scale (GQS), the Flesch Reading Ease Score (FRES), the Flesch-Kincaid Reading Grade Level (FKRGL), and the Similarity Index (iThenticate) were used to evaluate the performance of chatbots. Data were compared using the Kruskal-Wallis test, and the differences between chatbots were determined by the Conover multiple comparison test with Benjamini-Hochberg correction (α=.05).</p><p><strong>Results: </strong>There were no significant differences between the chatbots' DISCERN scores, except for one question where ChatGPT showed significantly higher reliability than Gemini or Copilot (P=.03). There was no statistically significant difference among AI tools in terms of GQS values (P=.096), FRES values (P=.166), and FKRGL values (P=.247). The similarity rate of Gemini was statistically higher than other AI chatbots (P=.03).</p><p><strong>Conclusions: </strong>ChatGPT-3.5, Gemini 2.5 Flash, Copilot, and DeepSeek V3 showed good quality responses. All chatbots' responses were difficult for non-professionals to read and understand. Low similarity rates were found for all chatbots except Gemini, indicating originality of their information.</p>","PeriodicalId":16866,"journal":{"name":"Journal of Prosthetic Dentistry","volume":" ","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Prosthetic Dentistry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.prosdent.2025.08.028","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Statement of problem: Despite advances in artificial intelligence (AI), the quality, reliability, and understandability of health-related information provided by chatbots is still a question mark. Furthermore, studies on maxillofacial prosthesis (MP) information from AI chatbots are lacking.
Purpose: The purpose of this study was to assess and compare the reliability, quality, readability, and similarity of responses to MP-related questions generated by 4 different chatbots.
Material and methods: A total of 15 questions were provided by a maxillofacial prosthodontist and from 4 different chatbots (ChatGPT-3.5, Gemini 2.5 Flash, Copilot, and DeepSeek V3). The Reliability Scoring (adapted DISCERN), the Global Quality Scale (GQS), the Flesch Reading Ease Score (FRES), the Flesch-Kincaid Reading Grade Level (FKRGL), and the Similarity Index (iThenticate) were used to evaluate the performance of chatbots. Data were compared using the Kruskal-Wallis test, and the differences between chatbots were determined by the Conover multiple comparison test with Benjamini-Hochberg correction (α=.05).
Results: There were no significant differences between the chatbots' DISCERN scores, except for one question where ChatGPT showed significantly higher reliability than Gemini or Copilot (P=.03). There was no statistically significant difference among AI tools in terms of GQS values (P=.096), FRES values (P=.166), and FKRGL values (P=.247). The similarity rate of Gemini was statistically higher than other AI chatbots (P=.03).
Conclusions: ChatGPT-3.5, Gemini 2.5 Flash, Copilot, and DeepSeek V3 showed good quality responses. All chatbots' responses were difficult for non-professionals to read and understand. Low similarity rates were found for all chatbots except Gemini, indicating originality of their information.
期刊介绍:
The Journal of Prosthetic Dentistry is the leading professional journal devoted exclusively to prosthetic and restorative dentistry. The Journal is the official publication for 24 leading U.S. international prosthodontic organizations. The monthly publication features timely, original peer-reviewed articles on the newest techniques, dental materials, and research findings. The Journal serves prosthodontists and dentists in advanced practice, and features color photos that illustrate many step-by-step procedures. The Journal of Prosthetic Dentistry is included in Index Medicus and CINAHL.