Mohamed G Hassan, Ahmed A Abdelaziz, Hams H Abdelrahman, Mostafa M Y Mohamed, Mohamed T Ellabban
{"title":"Performance of AI-Chatbots to Common Temporomandibular Joint Disorders (TMDs) Patient Queries: Accuracy, Completeness, Reliability and Readability.","authors":"Mohamed G Hassan, Ahmed A Abdelaziz, Hams H Abdelrahman, Mostafa M Y Mohamed, Mohamed T Ellabban","doi":"10.1111/ocr.12939","DOIUrl":null,"url":null,"abstract":"<p><p>TMDs are a common group of conditions affecting the temporomandibular joint (TMJ) often resulting from factors like injury, stress or teeth grinding. This study aimed to evaluate the accuracy, completeness, reliability and readability of the responses generated by ChatGPT-3.5, -4o and Google Gemini to TMD-related inquiries. Forty-five questions covering various aspects of TMDs were created by two experts and submitted by one author to ChatGPT-3.5, ChatGPT-4 and Google Gemini on the same day. The responses were evaluated for accuracy, completeness and reliability using modified Likert scales. Readability was analysed with six validated indices via a specialised tool. Additional features, such as the inclusion of graphical elements, references and safeguard mechanisms, were also documented and analysed. The Pearson Chi-Square and One-Way ANOVA tests were used for data analysis. Google Gemini achieved the highest accuracy, providing 100% correct responses, followed by ChatGPT-3.5 (95.6%) and ChatGPT-4o (93.3%). ChatGPT-4o provided the most complete responses (91.1%), followed by ChatGPT-03 (64.4%) and Google Gemini (42.2%). The majority of responses were reliable, with ChatGPT-4o at 93.3% 'Absolutely Reliable', compared to 46.7% for ChatGPT-3.5 and 48.9% for Google Gemini. Both ChatGPT-4o and Google Gemini included references in responses, 22.2% and 13.3%, respectively, while ChatGPT-3.5 included none. Google Gemini was the only model that included multimedia (6.7%). Readability scores were highest for ChatGPT-3.5, suggesting its responses were more complex than those of Google Gemini and ChatGPT-4o. Both ChatGPT-4o and Google Gemini demonstrated accuracy and reliability in addressing TMD-related questions, with their responses being clear, easy to understand and complemented by safeguard statements encouraging specialist consultation. However, both platforms lacked evidence-based references. Only Google Gemini incorporated multimedia elements into its answers.</p>","PeriodicalId":19652,"journal":{"name":"Orthodontics & Craniofacial Research","volume":" ","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Orthodontics & Craniofacial Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/ocr.12939","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
TMDs are a common group of conditions affecting the temporomandibular joint (TMJ) often resulting from factors like injury, stress or teeth grinding. This study aimed to evaluate the accuracy, completeness, reliability and readability of the responses generated by ChatGPT-3.5, -4o and Google Gemini to TMD-related inquiries. Forty-five questions covering various aspects of TMDs were created by two experts and submitted by one author to ChatGPT-3.5, ChatGPT-4 and Google Gemini on the same day. The responses were evaluated for accuracy, completeness and reliability using modified Likert scales. Readability was analysed with six validated indices via a specialised tool. Additional features, such as the inclusion of graphical elements, references and safeguard mechanisms, were also documented and analysed. The Pearson Chi-Square and One-Way ANOVA tests were used for data analysis. Google Gemini achieved the highest accuracy, providing 100% correct responses, followed by ChatGPT-3.5 (95.6%) and ChatGPT-4o (93.3%). ChatGPT-4o provided the most complete responses (91.1%), followed by ChatGPT-03 (64.4%) and Google Gemini (42.2%). The majority of responses were reliable, with ChatGPT-4o at 93.3% 'Absolutely Reliable', compared to 46.7% for ChatGPT-3.5 and 48.9% for Google Gemini. Both ChatGPT-4o and Google Gemini included references in responses, 22.2% and 13.3%, respectively, while ChatGPT-3.5 included none. Google Gemini was the only model that included multimedia (6.7%). Readability scores were highest for ChatGPT-3.5, suggesting its responses were more complex than those of Google Gemini and ChatGPT-4o. Both ChatGPT-4o and Google Gemini demonstrated accuracy and reliability in addressing TMD-related questions, with their responses being clear, easy to understand and complemented by safeguard statements encouraging specialist consultation. However, both platforms lacked evidence-based references. Only Google Gemini incorporated multimedia elements into its answers.
期刊介绍:
Orthodontics & Craniofacial Research - Genes, Growth and Development is published to serve its readers as an international forum for the presentation and critical discussion of issues pertinent to the advancement of the specialty of orthodontics and the evidence-based knowledge of craniofacial growth and development. This forum is based on scientifically supported information, but also includes minority and conflicting opinions.
The objective of the journal is to facilitate effective communication between the research community and practicing clinicians. Original papers of high scientific quality that report the findings of clinical trials, clinical epidemiology, and novel therapeutic or diagnostic approaches are appropriate submissions. Similarly, we welcome papers in genetics, developmental biology, syndromology, surgery, speech and hearing, and other biomedical disciplines related to clinical orthodontics and normal and abnormal craniofacial growth and development. In addition to original and basic research, the journal publishes concise reviews, case reports of substantial value, invited essays, letters, and announcements.
The journal is published quarterly. The review of submitted papers will be coordinated by the editor and members of the editorial board. It is policy to review manuscripts within 3 to 4 weeks of receipt and to publish within 3 to 6 months of acceptance.