Are answers obtained from artificial intelligence models for information purposes repeatable?

IF 1.9 Q2 DENTISTRY, ORAL SURGERY & MEDICINE

International Orthodontics Pub Date : 2025-10-03 DOI:10.1016/j.ortho.2025.101071

Yasemin Tunca , Volkan Kaplan , Murat Tunca

{"title":"Are answers obtained from artificial intelligence models for information purposes repeatable?","authors":"Yasemin Tunca , Volkan Kaplan , Murat Tunca","doi":"10.1016/j.ortho.2025.101071","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>The objective of this study was to assess the repeatability of orthodontic responses generated by multiple large language models across repeated time points.</div></div><div><h3>Methods</h3><div>This experimental study assessed the answers provided by ChatGPT-3.5, ChatGPT-4.0, Gemini, and Gemini-Advanced to 40 frequently asked orthodontic questions. Each model was prompted with the same questions at three time points (T0: day 0, T1: day 7, and T2: day 14). Two blinded orthodontic experts independently evaluated responses using a 3-point accuracy scale. Cohen's Kappa and ICC were applied to assess inter-rater agreement and repeatability, respectively. In addition, Friedman test with Bonferroni post-hoc analysis and Spearman correlation were used for temporal comparisons.</div></div><div><h3>Results</h3><div>Cohen's Kappa values between raters ranged from 0.624 to 0.749, indicating substantial inter-rater agreement. ICC values for repeatability ranged from 0.666 (Gemini) to 0.960 (ChatGPT-3.5). Friedman test results revealed significant differences in model accuracy at T0 and T2 (<em>P</em>    <0.001).</div></div><div><h3>Conclusions</h3><div>The study revealed statistically significant differences in repeatability among AI models. Despite high accuracy, some models exhibited limited consistency over time. These findings underscore the importance of evaluating both accuracy and temporal stability when integrating AI systems into clinical orthodontic communication.</div></div>","PeriodicalId":45449,"journal":{"name":"International Orthodontics","volume":"24 1","pages":"Article 101071"},"PeriodicalIF":1.9000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Orthodontics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1761722725001068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

The objective of this study was to assess the repeatability of orthodontic responses generated by multiple large language models across repeated time points.

Methods

This experimental study assessed the answers provided by ChatGPT-3.5, ChatGPT-4.0, Gemini, and Gemini-Advanced to 40 frequently asked orthodontic questions. Each model was prompted with the same questions at three time points (T0: day 0, T1: day 7, and T2: day 14). Two blinded orthodontic experts independently evaluated responses using a 3-point accuracy scale. Cohen's Kappa and ICC were applied to assess inter-rater agreement and repeatability, respectively. In addition, Friedman test with Bonferroni post-hoc analysis and Spearman correlation were used for temporal comparisons.

Results

Cohen's Kappa values between raters ranged from 0.624 to 0.749, indicating substantial inter-rater agreement. ICC values for repeatability ranged from 0.666 (Gemini) to 0.960 (ChatGPT-3.5). Friedman test results revealed significant differences in model accuracy at T0 and T2 (P < 0.001). Post-hoc analysis showed ChatGPT-3.5 differed significantly from Gemini and Gemini Advanced. Spearman correlations between time points were positive but weak (ρ = 0.284 to 0.383, P < 0.001).

Conclusions

The study revealed statistically significant differences in repeatability among AI models. Despite high accuracy, some models exhibited limited consistency over time. These findings underscore the importance of evaluating both accuracy and temporal stability when integrating AI systems into clinical orthodontic communication.

查看原文本刊更多论文

从用于信息目的的人工智能模型获得的答案是否可重复？

本研究的目的是评估多个大型语言模型在重复时间点上产生的正畸反应的可重复性。方法对ChatGPT-3.5、ChatGPT-4.0、Gemini和Gemini- advanced对40个常见正畸问题的回答进行评估。每个模型在三个时间点（T0：第0天，T1：第7天，T2：第14天）提示相同的问题。两名盲法正畸专家使用3分准确度量表独立评估反应。Cohen’s Kappa和ICC分别用于评估评分者之间的一致性和可重复性。此外，采用Friedman检验与Bonferroni事后分析和Spearman相关进行时间比较。结果评价者之间的scohen’s Kappa值在0.624 ~ 0.749之间，表明评价者之间存在较大的一致性。重复性的ICC值范围从0.666 （Gemini）到0.960 （ChatGPT-3.5）。Friedman检验结果显示，在T0和T2时，模型精度存在显著差异（P < 0.001）。事后分析显示，ChatGPT-3.5与Gemini和Gemini Advanced有显著差异。时间点间Spearman相关性为正但较弱（ρ = 0.284 ~ 0.383, P < 0.001）。结论人工智能模型的可重复性存在统计学差异。尽管精度很高，但随着时间的推移，一些模型表现出有限的一致性。这些发现强调了在将人工智能系统整合到临床正畸沟通中时评估准确性和时间稳定性的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Orthodontics DENTISTRY, ORAL SURGERY & MEDICINE-

CiteScore

2.50

自引率

13.30%

发文量

审稿时长

26 days

期刊介绍： Une revue de référence dans le domaine de orthodontie et des disciplines frontières Your reference in dentofacial orthopedics International Orthodontics adresse aux orthodontistes, aux dentistes, aux stomatologistes, aux chirurgiens maxillo-faciaux et aux plasticiens de la face, ainsi quà leurs assistant(e)s. International Orthodontics is addressed to orthodontists, dentists, stomatologists, maxillofacial surgeons and facial plastic surgeons, as well as their assistants.