{"title":"Can Artificial Intelligence Language Models Effectively Address Dental Trauma Questions?","authors":"Hasibe Elif Kuru, Aslı Aşık, Doğukan Mert Demir","doi":"10.1111/edt.13063","DOIUrl":null,"url":null,"abstract":"<p><strong>Background/aim: </strong>Artificial intelligence (AI) chatbots, also known as large language models (LLMs), have become increasingly common educational tools in healthcare. Although the use of LLMs for emergency dental trauma is gaining popularity, it is crucial to assess their reliability. This study aimed to compare the reliabilities of different LLMs in response to multiple questions related to dental trauma.</p><p><strong>Materials and methods: </strong>In a cross-sectional observational study conducted in October 2024, 30 questions (10 multiple-choice, 10 fill-in-the-blank, and 10 dichotomous) based on the International Association of Dental Traumatology guidelines were posed to five LLMs: ChatGPT 4, ChatGPT 3.5, Copilot Free version (Copilot F), Copilot Pro (Copilot P), and Google Gemini over nine consecutive days. Responses of each model (1350 in total) were recorded in binary format and analyzed using Pearson's chi-square and Fisher's exact tests to assess correctness and consistency (p < 0.05).</p><p><strong>Results: </strong>The answers provided by the LLMs to repeated questions on consecutive days showed a high degree of repeatability. Although there was no statistically significant difference in the success rate of providing correct answers among the LLMs (p > 0.05), the rankings based on the rate of successful answers were as follows: ChatGPT 3.5 (76.7%) > Copilot P (73.3%) > Copilot F (70%) > ChatGPT 4 (63.3%) > Gemini (46.7%). ChatGPT 3.5, ChatGPT 4, and Gemini showed a significantly higher correct response rate for multiple choice and fill in the blank questions compared to their performance on dichotomous (true/false) questions (p < 0.05). Conversely, The Copilot models did not exhibit significant differences across question types. Notably, the explanations provided by Copilot and Gemini were often inaccurate, and Copilot's cited references had low evidential value.</p><p><strong>Conclusions: </strong>While LLMs show potential as adjunct educational tools in dental traumatology, their variable accuracy and inclusion of unreliable references call for careful integration strategies.</p>","PeriodicalId":55180,"journal":{"name":"Dental Traumatology","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dental Traumatology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/edt.13063","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Background/aim: Artificial intelligence (AI) chatbots, also known as large language models (LLMs), have become increasingly common educational tools in healthcare. Although the use of LLMs for emergency dental trauma is gaining popularity, it is crucial to assess their reliability. This study aimed to compare the reliabilities of different LLMs in response to multiple questions related to dental trauma.
Materials and methods: In a cross-sectional observational study conducted in October 2024, 30 questions (10 multiple-choice, 10 fill-in-the-blank, and 10 dichotomous) based on the International Association of Dental Traumatology guidelines were posed to five LLMs: ChatGPT 4, ChatGPT 3.5, Copilot Free version (Copilot F), Copilot Pro (Copilot P), and Google Gemini over nine consecutive days. Responses of each model (1350 in total) were recorded in binary format and analyzed using Pearson's chi-square and Fisher's exact tests to assess correctness and consistency (p < 0.05).
Results: The answers provided by the LLMs to repeated questions on consecutive days showed a high degree of repeatability. Although there was no statistically significant difference in the success rate of providing correct answers among the LLMs (p > 0.05), the rankings based on the rate of successful answers were as follows: ChatGPT 3.5 (76.7%) > Copilot P (73.3%) > Copilot F (70%) > ChatGPT 4 (63.3%) > Gemini (46.7%). ChatGPT 3.5, ChatGPT 4, and Gemini showed a significantly higher correct response rate for multiple choice and fill in the blank questions compared to their performance on dichotomous (true/false) questions (p < 0.05). Conversely, The Copilot models did not exhibit significant differences across question types. Notably, the explanations provided by Copilot and Gemini were often inaccurate, and Copilot's cited references had low evidential value.
Conclusions: While LLMs show potential as adjunct educational tools in dental traumatology, their variable accuracy and inclusion of unreliable references call for careful integration strategies.
期刊介绍:
Dental Traumatology is an international journal that aims to convey scientific and clinical progress in all areas related to adult and pediatric dental traumatology. This includes the following topics:
- Epidemiology, Social Aspects, Education, Diagnostics
- Esthetics / Prosthetics/ Restorative
- Evidence Based Traumatology & Study Design
- Oral & Maxillofacial Surgery/Transplant/Implant
- Pediatrics and Orthodontics
- Prevention and Sports Dentistry
- Endodontics and Periodontal Aspects
The journal"s aim is to promote communication among clinicians, educators, researchers, and others interested in the field of dental traumatology.