Performance of AI-Chatbots to Common Temporomandibular Joint Disorders (TMDs) Patient Queries: Accuracy, Completeness, Reliability and Readability.

IF 2.4 3区医学 Q2 DENTISTRY, ORAL SURGERY & MEDICINE

Orthodontics & Craniofacial Research Pub Date : 2025-05-07 DOI:10.1111/ocr.12939

Mohamed G Hassan, Ahmed A Abdelaziz, Hams H Abdelrahman, Mostafa M Y Mohamed, Mohamed T Ellabban

{"title":"Performance of AI-Chatbots to Common Temporomandibular Joint Disorders (TMDs) Patient Queries: Accuracy, Completeness, Reliability and Readability.","authors":"Mohamed G Hassan, Ahmed A Abdelaziz, Hams H Abdelrahman, Mostafa M Y Mohamed, Mohamed T Ellabban","doi":"10.1111/ocr.12939","DOIUrl":null,"url":null,"abstract":"<p><p>TMDs are a common group of conditions affecting the temporomandibular joint (TMJ) often resulting from factors like injury, stress or teeth grinding. This study aimed to evaluate the accuracy, completeness, reliability and readability of the responses generated by ChatGPT-3.5, -4o and Google Gemini to TMD-related inquiries. Forty-five questions covering various aspects of TMDs were created by two experts and submitted by one author to ChatGPT-3.5, ChatGPT-4 and Google Gemini on the same day. The responses were evaluated for accuracy, completeness and reliability using modified Likert scales. Readability was analysed with six validated indices via a specialised tool. Additional features, such as the inclusion of graphical elements, references and safeguard mechanisms, were also documented and analysed. The Pearson Chi-Square and One-Way ANOVA tests were used for data analysis. Google Gemini achieved the highest accuracy, providing 100% correct responses, followed by ChatGPT-3.5 (95.6%) and ChatGPT-4o (93.3%). ChatGPT-4o provided the most complete responses (91.1%), followed by ChatGPT-03 (64.4%) and Google Gemini (42.2%). The majority of responses were reliable, with ChatGPT-4o at 93.3% 'Absolutely Reliable', compared to 46.7% for ChatGPT-3.5 and 48.9% for Google Gemini. Both ChatGPT-4o and Google Gemini included references in responses, 22.2% and 13.3%, respectively, while ChatGPT-3.5 included none. Google Gemini was the only model that included multimedia (6.7%). Readability scores were highest for ChatGPT-3.5, suggesting its responses were more complex than those of Google Gemini and ChatGPT-4o. Both ChatGPT-4o and Google Gemini demonstrated accuracy and reliability in addressing TMD-related questions, with their responses being clear, easy to understand and complemented by safeguard statements encouraging specialist consultation. However, both platforms lacked evidence-based references. Only Google Gemini incorporated multimedia elements into its answers.</p>","PeriodicalId":19652,"journal":{"name":"Orthodontics & Craniofacial Research","volume":" ","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Orthodontics & Craniofacial Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/ocr.12939","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

TMDs are a common group of conditions affecting the temporomandibular joint (TMJ) often resulting from factors like injury, stress or teeth grinding. This study aimed to evaluate the accuracy, completeness, reliability and readability of the responses generated by ChatGPT-3.5, -4o and Google Gemini to TMD-related inquiries. Forty-five questions covering various aspects of TMDs were created by two experts and submitted by one author to ChatGPT-3.5, ChatGPT-4 and Google Gemini on the same day. The responses were evaluated for accuracy, completeness and reliability using modified Likert scales. Readability was analysed with six validated indices via a specialised tool. Additional features, such as the inclusion of graphical elements, references and safeguard mechanisms, were also documented and analysed. The Pearson Chi-Square and One-Way ANOVA tests were used for data analysis. Google Gemini achieved the highest accuracy, providing 100% correct responses, followed by ChatGPT-3.5 (95.6%) and ChatGPT-4o (93.3%). ChatGPT-4o provided the most complete responses (91.1%), followed by ChatGPT-03 (64.4%) and Google Gemini (42.2%). The majority of responses were reliable, with ChatGPT-4o at 93.3% 'Absolutely Reliable', compared to 46.7% for ChatGPT-3.5 and 48.9% for Google Gemini. Both ChatGPT-4o and Google Gemini included references in responses, 22.2% and 13.3%, respectively, while ChatGPT-3.5 included none. Google Gemini was the only model that included multimedia (6.7%). Readability scores were highest for ChatGPT-3.5, suggesting its responses were more complex than those of Google Gemini and ChatGPT-4o. Both ChatGPT-4o and Google Gemini demonstrated accuracy and reliability in addressing TMD-related questions, with their responses being clear, easy to understand and complemented by safeguard statements encouraging specialist consultation. However, both platforms lacked evidence-based references. Only Google Gemini incorporated multimedia elements into its answers.

查看原文本刊更多论文

人工智能聊天机器人对常见颞下颌关节疾病（TMDs）患者查询的表现：准确性、完整性、可靠性和可读性。

颞下颌关节疾病是影响颞下颌关节（TMJ）的一组常见疾病，通常由损伤、压力或磨牙等因素引起。本研究旨在评估ChatGPT-3.5、- 40和谷歌Gemini对tmd相关查询的回答的准确性、完整性、可靠性和可读性。45个问题涵盖了tmd的各个方面，由两位专家创建，并由一位作者在同一天提交给ChatGPT-3.5、ChatGPT-4和谷歌Gemini。使用改进的李克特量表评估问卷的准确性、完整性和可靠性。通过一个专门的工具用六个有效的指数分析可读性。还记录和分析了其他特征，例如包含图形元素、参考文献和保障机制。数据分析采用皮尔逊卡方检验和单因素方差分析检验。b谷歌Gemini的准确率最高，提供了100%的正确答案，其次是ChatGPT-3.5（95.6%）和chatgpt - 40（93.3%）。chatgpt - 40提供了最完整的回复（91.1%），其次是ChatGPT-03（64.4%）和谷歌Gemini（42.2%）。大多数回答是可靠的，chatgpt - 40的“绝对可靠”率为93.3%，而ChatGPT-3.5和谷歌Gemini的“绝对可靠”率分别为46.7%和48.9%。chatgpt - 40和谷歌Gemini在回答中分别包含22.2%和13.3%的参考文献，而ChatGPT-3.5则没有。谷歌Gemini是唯一包含多媒体的型号（6.7%）。ChatGPT-3.5的可读性得分最高，这表明它的反应比谷歌Gemini和chatgpt - 40更复杂。chatgpt - 40和谷歌Gemini在解决tmd相关问题时均表现出准确性和可靠性，其回答清晰、易于理解，并附有保障声明，鼓励专家咨询。然而，这两个平台都缺乏基于证据的参考。只有b谷歌Gemini在答案中加入了多媒体元素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Orthodontics & Craniofacial Research 医学-牙科与口腔外科

CiteScore

5.30

自引率

3.20%

发文量

审稿时长

>12 weeks

期刊介绍： Orthodontics & Craniofacial Research - Genes, Growth and Development is published to serve its readers as an international forum for the presentation and critical discussion of issues pertinent to the advancement of the specialty of orthodontics and the evidence-based knowledge of craniofacial growth and development. This forum is based on scientifically supported information, but also includes minority and conflicting opinions. The objective of the journal is to facilitate effective communication between the research community and practicing clinicians. Original papers of high scientific quality that report the findings of clinical trials, clinical epidemiology, and novel therapeutic or diagnostic approaches are appropriate submissions. Similarly, we welcome papers in genetics, developmental biology, syndromology, surgery, speech and hearing, and other biomedical disciplines related to clinical orthodontics and normal and abnormal craniofacial growth and development. In addition to original and basic research, the journal publishes concise reviews, case reports of substantial value, invited essays, letters, and announcements. The journal is published quarterly. The review of submitted papers will be coordinated by the editor and members of the editorial board. It is policy to review manuscripts within 3 to 4 weeks of receipt and to publish within 3 to 6 months of acceptance.