Igor Amador Barbosa, Mauro Sergio Almeida Alves, Paloma Rayse Zagalo de Almeida, Patricia de Almeida Rodrigues, Roberta Pimentel de Oliveira, Silvio Augusto Fernades de Menezes, João Daniel Mendonça de Moura, Ricardo Roberto de Souza Fonseca
{"title":"Assessing the diagnostic and treatment accuracy of Large Language Models (LLMs) in Peri-implant diseases: A clinical experimental study","authors":"Igor Amador Barbosa, Mauro Sergio Almeida Alves, Paloma Rayse Zagalo de Almeida, Patricia de Almeida Rodrigues, Roberta Pimentel de Oliveira, Silvio Augusto Fernades de Menezes, João Daniel Mendonça de Moura, Ricardo Roberto de Souza Fonseca","doi":"10.1016/j.jdent.2025.106091","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>This study evaluated the coherence, consistency, and diagnostic accuracy of eight AI-based chatbots in clinical scenarios related to dental implants.</div></div><div><h3>Methods</h3><div>A double-blind, clinical experimental study was carried out between February and March 2025, to evaluate eight AI-based chatbots using six fictional cases simulating peri‑implant mucositis and peri‑implantitis. Each chatbot answered five standardized clinical questions across three independent runs per case, generating 720 binary outputs. Blinded investigators scored each response against a gold standard. Statistical analyses included chi-square and Fisher’s exact and Cohen’s Kappa tests were used to assess intra-model consistency, stability and reliability for each AI chatbot.</div></div><div><h3>Results</h3><div>GPT-4o demonstrated the highest diagnostic accuracy (88.8 %), followed by Gemini (77.7 %), OpenAI o3-mini (72.2 %), OpenAI o3-mini-high (71.1 %), Claude (66.6 %), OpenAI o1 (60 %), DeepSeek (55.5 %), and Copilot (49.9 %). GPT-4o also showed the highest intra-model stability (κ = 0.82) and consistency, while Copilot and DeepSeek showed the lowest reliability. Significant differences were observed only in the reference citation criterion (p < 0.001), with Gemini being the only AI chatbot to achieve 100 % compliance, but GPT-4o consistently outperformed the other AI chatbots across all evaluation domains.</div></div><div><h3>Conclusion</h3><div>GPT-4o demonstrated superior diagnostic accuracy and response consistency, reinforcing the influence of AI chatbot architecture and training on clinical reasoning performance. In contrast, Copilot showed lower reliability and higher variability, emphasizing the need for cautious, evidence-based adoption of AI tools in the diagnosis of peri‑implant diseases.</div></div><div><h3>Clinical relevance</h3><div>Understanding AI performance in peri‑implant diagnosis to support evidence-based decision-making using AI and its responsible clinical use.</div></div>","PeriodicalId":15585,"journal":{"name":"Journal of dentistry","volume":"162 ","pages":"Article 106091"},"PeriodicalIF":5.5000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of dentistry","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0300571225005378","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
This study evaluated the coherence, consistency, and diagnostic accuracy of eight AI-based chatbots in clinical scenarios related to dental implants.
Methods
A double-blind, clinical experimental study was carried out between February and March 2025, to evaluate eight AI-based chatbots using six fictional cases simulating peri‑implant mucositis and peri‑implantitis. Each chatbot answered five standardized clinical questions across three independent runs per case, generating 720 binary outputs. Blinded investigators scored each response against a gold standard. Statistical analyses included chi-square and Fisher’s exact and Cohen’s Kappa tests were used to assess intra-model consistency, stability and reliability for each AI chatbot.
Results
GPT-4o demonstrated the highest diagnostic accuracy (88.8 %), followed by Gemini (77.7 %), OpenAI o3-mini (72.2 %), OpenAI o3-mini-high (71.1 %), Claude (66.6 %), OpenAI o1 (60 %), DeepSeek (55.5 %), and Copilot (49.9 %). GPT-4o also showed the highest intra-model stability (κ = 0.82) and consistency, while Copilot and DeepSeek showed the lowest reliability. Significant differences were observed only in the reference citation criterion (p < 0.001), with Gemini being the only AI chatbot to achieve 100 % compliance, but GPT-4o consistently outperformed the other AI chatbots across all evaluation domains.
Conclusion
GPT-4o demonstrated superior diagnostic accuracy and response consistency, reinforcing the influence of AI chatbot architecture and training on clinical reasoning performance. In contrast, Copilot showed lower reliability and higher variability, emphasizing the need for cautious, evidence-based adoption of AI tools in the diagnosis of peri‑implant diseases.
Clinical relevance
Understanding AI performance in peri‑implant diagnosis to support evidence-based decision-making using AI and its responsible clinical use.
期刊介绍:
The Journal of Dentistry has an open access mirror journal The Journal of Dentistry: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review.
The Journal of Dentistry is the leading international dental journal within the field of Restorative Dentistry. Placing an emphasis on publishing novel and high-quality research papers, the Journal aims to influence the practice of dentistry at clinician, research, industry and policy-maker level on an international basis.
Topics covered include the management of dental disease, periodontology, endodontology, operative dentistry, fixed and removable prosthodontics, dental biomaterials science, long-term clinical trials including epidemiology and oral health, technology transfer of new scientific instrumentation or procedures, as well as clinically relevant oral biology and translational research.
The Journal of Dentistry will publish original scientific research papers including short communications. It is also interested in publishing review articles and leaders in themed areas which will be linked to new scientific research. Conference proceedings are also welcome and expressions of interest should be communicated to the Editor.