{"title":"Are answers obtained from artificial intelligence models for information purposes repeatable?","authors":"Yasemin Tunca , Volkan Kaplan , Murat Tunca","doi":"10.1016/j.ortho.2025.101071","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>The objective of this study was to assess the repeatability of orthodontic responses generated by multiple large language models across repeated time points.</div></div><div><h3>Methods</h3><div>This experimental study assessed the answers provided by ChatGPT-3.5, ChatGPT-4.0, Gemini, and Gemini-Advanced to 40 frequently asked orthodontic questions. Each model was prompted with the same questions at three time points (T0: day 0, T1: day 7, and T2: day 14). Two blinded orthodontic experts independently evaluated responses using a 3-point accuracy scale. Cohen's Kappa and ICC were applied to assess inter-rater agreement and repeatability, respectively. In addition, Friedman test with Bonferroni post-hoc analysis and Spearman correlation were used for temporal comparisons.</div></div><div><h3>Results</h3><div>Cohen's Kappa values between raters ranged from 0.624 to 0.749, indicating substantial inter-rater agreement. ICC values for repeatability ranged from 0.666 (Gemini) to 0.960 (ChatGPT-3.5). Friedman test results revealed significant differences in model accuracy at T0 and T2 (<em>P</em> <!--><<!--> <!-->0.001). Post-hoc analysis showed ChatGPT-3.5 differed significantly from Gemini and Gemini Advanced. Spearman correlations between time points were positive but weak (ρ<!--> <!-->=<!--> <!-->0.284 to 0.383, <em>P</em> <!--><<!--> <!-->0.001).</div></div><div><h3>Conclusions</h3><div>The study revealed statistically significant differences in repeatability among AI models. Despite high accuracy, some models exhibited limited consistency over time. These findings underscore the importance of evaluating both accuracy and temporal stability when integrating AI systems into clinical orthodontic communication.</div></div>","PeriodicalId":45449,"journal":{"name":"International Orthodontics","volume":"24 1","pages":"Article 101071"},"PeriodicalIF":1.9000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Orthodontics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1761722725001068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction
The objective of this study was to assess the repeatability of orthodontic responses generated by multiple large language models across repeated time points.
Methods
This experimental study assessed the answers provided by ChatGPT-3.5, ChatGPT-4.0, Gemini, and Gemini-Advanced to 40 frequently asked orthodontic questions. Each model was prompted with the same questions at three time points (T0: day 0, T1: day 7, and T2: day 14). Two blinded orthodontic experts independently evaluated responses using a 3-point accuracy scale. Cohen's Kappa and ICC were applied to assess inter-rater agreement and repeatability, respectively. In addition, Friedman test with Bonferroni post-hoc analysis and Spearman correlation were used for temporal comparisons.
Results
Cohen's Kappa values between raters ranged from 0.624 to 0.749, indicating substantial inter-rater agreement. ICC values for repeatability ranged from 0.666 (Gemini) to 0.960 (ChatGPT-3.5). Friedman test results revealed significant differences in model accuracy at T0 and T2 (P < 0.001). Post-hoc analysis showed ChatGPT-3.5 differed significantly from Gemini and Gemini Advanced. Spearman correlations between time points were positive but weak (ρ = 0.284 to 0.383, P < 0.001).
Conclusions
The study revealed statistically significant differences in repeatability among AI models. Despite high accuracy, some models exhibited limited consistency over time. These findings underscore the importance of evaluating both accuracy and temporal stability when integrating AI systems into clinical orthodontic communication.
期刊介绍:
Une revue de référence dans le domaine de orthodontie et des disciplines frontières Your reference in dentofacial orthopedics International Orthodontics adresse aux orthodontistes, aux dentistes, aux stomatologistes, aux chirurgiens maxillo-faciaux et aux plasticiens de la face, ainsi quà leurs assistant(e)s. International Orthodontics is addressed to orthodontists, dentists, stomatologists, maxillofacial surgeons and facial plastic surgeons, as well as their assistants.