Assessing the diagnostic and treatment accuracy of Large Language Models (LLMs) in Peri-implant diseases: A clinical experimental study

IF 5.5 2区 医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE
Igor Amador Barbosa, Mauro Sergio Almeida Alves, Paloma Rayse Zagalo de Almeida, Patricia de Almeida Rodrigues, Roberta Pimentel de Oliveira, Silvio Augusto Fernades de Menezes, João Daniel Mendonça de Moura, Ricardo Roberto de Souza Fonseca
{"title":"Assessing the diagnostic and treatment accuracy of Large Language Models (LLMs) in Peri-implant diseases: A clinical experimental study","authors":"Igor Amador Barbosa,&nbsp;Mauro Sergio Almeida Alves,&nbsp;Paloma Rayse Zagalo de Almeida,&nbsp;Patricia de Almeida Rodrigues,&nbsp;Roberta Pimentel de Oliveira,&nbsp;Silvio Augusto Fernades de Menezes,&nbsp;João Daniel Mendonça de Moura,&nbsp;Ricardo Roberto de Souza Fonseca","doi":"10.1016/j.jdent.2025.106091","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>This study evaluated the coherence, consistency, and diagnostic accuracy of eight AI-based chatbots in clinical scenarios related to dental implants.</div></div><div><h3>Methods</h3><div>A double-blind, clinical experimental study was carried out between February and March 2025, to evaluate eight AI-based chatbots using six fictional cases simulating peri‑implant mucositis and peri‑implantitis. Each chatbot answered five standardized clinical questions across three independent runs per case, generating 720 binary outputs. Blinded investigators scored each response against a gold standard. Statistical analyses included chi-square and Fisher’s exact and Cohen’s Kappa tests were used to assess intra-model consistency, stability and reliability for each AI chatbot.</div></div><div><h3>Results</h3><div>GPT-4o demonstrated the highest diagnostic accuracy (88.8 %), followed by Gemini (77.7 %), OpenAI o3-mini (72.2 %), OpenAI o3-mini-high (71.1 %), Claude (66.6 %), OpenAI o1 (60 %), DeepSeek (55.5 %), and Copilot (49.9 %). GPT-4o also showed the highest intra-model stability (κ = 0.82) and consistency, while Copilot and DeepSeek showed the lowest reliability. Significant differences were observed only in the reference citation criterion (p &lt; 0.001), with Gemini being the only AI chatbot to achieve 100 % compliance, but GPT-4o consistently outperformed the other AI chatbots across all evaluation domains.</div></div><div><h3>Conclusion</h3><div>GPT-4o demonstrated superior diagnostic accuracy and response consistency, reinforcing the influence of AI chatbot architecture and training on clinical reasoning performance. In contrast, Copilot showed lower reliability and higher variability, emphasizing the need for cautious, evidence-based adoption of AI tools in the diagnosis of peri‑implant diseases.</div></div><div><h3>Clinical relevance</h3><div>Understanding AI performance in peri‑implant diagnosis to support evidence-based decision-making using AI and its responsible clinical use.</div></div>","PeriodicalId":15585,"journal":{"name":"Journal of dentistry","volume":"162 ","pages":"Article 106091"},"PeriodicalIF":5.5000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of dentistry","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0300571225005378","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

Objective

This study evaluated the coherence, consistency, and diagnostic accuracy of eight AI-based chatbots in clinical scenarios related to dental implants.

Methods

A double-blind, clinical experimental study was carried out between February and March 2025, to evaluate eight AI-based chatbots using six fictional cases simulating peri‑implant mucositis and peri‑implantitis. Each chatbot answered five standardized clinical questions across three independent runs per case, generating 720 binary outputs. Blinded investigators scored each response against a gold standard. Statistical analyses included chi-square and Fisher’s exact and Cohen’s Kappa tests were used to assess intra-model consistency, stability and reliability for each AI chatbot.

Results

GPT-4o demonstrated the highest diagnostic accuracy (88.8 %), followed by Gemini (77.7 %), OpenAI o3-mini (72.2 %), OpenAI o3-mini-high (71.1 %), Claude (66.6 %), OpenAI o1 (60 %), DeepSeek (55.5 %), and Copilot (49.9 %). GPT-4o also showed the highest intra-model stability (κ = 0.82) and consistency, while Copilot and DeepSeek showed the lowest reliability. Significant differences were observed only in the reference citation criterion (p < 0.001), with Gemini being the only AI chatbot to achieve 100 % compliance, but GPT-4o consistently outperformed the other AI chatbots across all evaluation domains.

Conclusion

GPT-4o demonstrated superior diagnostic accuracy and response consistency, reinforcing the influence of AI chatbot architecture and training on clinical reasoning performance. In contrast, Copilot showed lower reliability and higher variability, emphasizing the need for cautious, evidence-based adoption of AI tools in the diagnosis of peri‑implant diseases.

Clinical relevance

Understanding AI performance in peri‑implant diagnosis to support evidence-based decision-making using AI and its responsible clinical use.
评估大语言模型(LLMs)对种植体周围疾病的诊断和治疗准确性:一项临床实验研究
目的:本研究评估了8种基于人工智能的聊天机器人在牙种植体临床场景中的一致性、一致性和诊断准确性。方法:于2025年2月至3月开展双盲临床实验研究,通过6个模拟种植体周围粘膜炎和种植体周围炎的虚构病例,对8个基于人工智能的聊天机器人进行评估。每个聊天机器人在每个病例的三次独立运行中回答五个标准化的临床问题,产生720个二进制输出。盲法调查人员根据金标准对每个回答进行评分。统计分析包括卡方检验、Fisher精确检验和Cohen Kappa检验来评估每个AI聊天机器人的模型内一致性、稳定性和可靠性。结果:gpt - 40的诊断准确率最高(88.8%),其次是Gemini(77.7%)、OpenAI o3-mini(72.2%)、OpenAI o3-mini-high(71.1%)、Claude(66.6%)、OpenAI o1(60%)、DeepSeek(55.5%)和Copilot(49.9%)。gpt - 40也具有最高的模型内稳定性(κ = 0.82)和一致性,而Copilot和DeepSeek的可靠性最低。结论:gpt - 40表现出更好的诊断准确性和响应一致性,强化了AI聊天机器人架构和训练对临床推理表现的影响。相比之下,Copilot显示出较低的可靠性和较高的变异性,这强调了在诊断种植体周围疾病时需要谨慎、循证地采用人工智能工具。临床相关性:了解人工智能在种植体周围诊断中的表现,以支持人工智能的循证决策和负责任的临床使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of dentistry
Journal of dentistry 医学-牙科与口腔外科
CiteScore
7.30
自引率
11.40%
发文量
349
审稿时长
35 days
期刊介绍: The Journal of Dentistry has an open access mirror journal The Journal of Dentistry: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review. The Journal of Dentistry is the leading international dental journal within the field of Restorative Dentistry. Placing an emphasis on publishing novel and high-quality research papers, the Journal aims to influence the practice of dentistry at clinician, research, industry and policy-maker level on an international basis. Topics covered include the management of dental disease, periodontology, endodontology, operative dentistry, fixed and removable prosthodontics, dental biomaterials science, long-term clinical trials including epidemiology and oral health, technology transfer of new scientific instrumentation or procedures, as well as clinically relevant oral biology and translational research. The Journal of Dentistry will publish original scientific research papers including short communications. It is also interested in publishing review articles and leaders in themed areas which will be linked to new scientific research. Conference proceedings are also welcome and expressions of interest should be communicated to the Editor.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信