Evaluating the accuracy of generative artificial intelligence models in dental age estimation based on the Demirjian's method.

IF 1.8 Q3 DENTISTRY, ORAL SURGERY & MEDICINE

Frontiers in dental medicine Pub Date : 2025-07-29 eCollection Date: 2025-01-01 DOI:10.3389/fdmed.2025.1634006

Allan Abuabara, Thais Vilalba Paniagua Machado do Nascimento, Seandra Maria Trentini, Angela Mairane Costa Gonçalves, Maria Angélica Hueb de Menezes-Oliveira, Isabela Ribeiro Madalena, Svenja Beisel-Memmert, Christian Kirschneck, Livia Azeredo Alves Antunes, Cristiano Miranda de Araujo, Flares Baratto-Filho, Erika Calvano Küchler

{"title":"Evaluating the accuracy of generative artificial intelligence models in dental age estimation based on the Demirjian's method.","authors":"Allan Abuabara, Thais Vilalba Paniagua Machado do Nascimento, Seandra Maria Trentini, Angela Mairane Costa Gonçalves, Maria Angélica Hueb de Menezes-Oliveira, Isabela Ribeiro Madalena, Svenja Beisel-Memmert, Christian Kirschneck, Livia Azeredo Alves Antunes, Cristiano Miranda de Araujo, Flares Baratto-Filho, Erika Calvano Küchler","doi":"10.3389/fdmed.2025.1634006","DOIUrl":null,"url":null,"abstract":"Introduction: Dental age estimation plays a key role in forensic identification, clinical diagnosis, treatment planning, and prognosis in fields such as pediatric dentistry and orthodontics. Large language models (LLM) are increasingly being recognized for their potential applications in Dentistry. This study aimed to compare the performance of currently available generative artificial intelligence LLM technologies in estimating dental age using the Demirjian's scores.Methods: Panoramic radiographs were analyzed using Demirjian's method (1973), with each left permanent mandibular tooth classified from stage A to H. Untrained LLM, ChatGPT (GPT-4-turbo), Gemini 2.0 Flash, and DeepSeek-V3 were tasked with estimating dental age based on the patient's Demirjian score for each tooth. Due to the probabilistic nature of ChatGPT, Gemini, and DeepSeek, which can produce varying responses to the same question, three responses were collected per case per day (three different computers) from each model on three separate days. The age estimates obtained from LLM were compared to the individuals' chronological ages. Intra- and inter-examiner reliability was assessed using the Intraclass Correlation Coefficient (ICC). Model performance was evaluated using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Coefficient of Determination (R 2), and Bias.Results: Thirty panoramic radiographs (40% female, 60% male; mean age 10.4 ± 2.32 years) were included. Both intra- and inter-examiner ICC values exceeded 0.85. ChatGPT and DeepSeek exhibited comparable but suboptimal performance, with higher errors (MAE: 1.98-2.05 years; RMSE: 2.33-2.35 years), negative R 2 values (-0.069 to -0.049), and substantial overestimation biases (1.90-1.91 years), indicating poor model fit and systematic flaws. Gemini demonstrated intermediate results, with a moderate MAE (1.57 years) and RMSE (1.81 years), a positive R 2 (0.367), and a lower bias (1.32 years).Discussion: This study demonstrated that, although LLM like ChatGPT, Gemini, and DeepSeek can estimate dental age using Demirjian's scores, their performance remains inferior to the traditional method. Among them, DeepSeek-V3 showed the best results, but all models require task-specific training and validation before clinical application.","PeriodicalId":73077,"journal":{"name":"Frontiers in dental medicine","volume":"6 ","pages":"1634006"},"PeriodicalIF":1.8000,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12339434/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in dental medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdmed.2025.1634006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Dental age estimation plays a key role in forensic identification, clinical diagnosis, treatment planning, and prognosis in fields such as pediatric dentistry and orthodontics. Large language models (LLM) are increasingly being recognized for their potential applications in Dentistry. This study aimed to compare the performance of currently available generative artificial intelligence LLM technologies in estimating dental age using the Demirjian's scores.

Methods: Panoramic radiographs were analyzed using Demirjian's method (1973), with each left permanent mandibular tooth classified from stage A to H. Untrained LLM, ChatGPT (GPT-4-turbo), Gemini 2.0 Flash, and DeepSeek-V3 were tasked with estimating dental age based on the patient's Demirjian score for each tooth. Due to the probabilistic nature of ChatGPT, Gemini, and DeepSeek, which can produce varying responses to the same question, three responses were collected per case per day (three different computers) from each model on three separate days. The age estimates obtained from LLM were compared to the individuals' chronological ages. Intra- and inter-examiner reliability was assessed using the Intraclass Correlation Coefficient (ICC). Model performance was evaluated using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Coefficient of Determination (R ²), and Bias.

Results: Thirty panoramic radiographs (40% female, 60% male; mean age 10.4 ± 2.32 years) were included. Both intra- and inter-examiner ICC values exceeded 0.85. ChatGPT and DeepSeek exhibited comparable but suboptimal performance, with higher errors (MAE: 1.98-2.05 years; RMSE: 2.33-2.35 years), negative R ² values (-0.069 to -0.049), and substantial overestimation biases (1.90-1.91 years), indicating poor model fit and systematic flaws. Gemini demonstrated intermediate results, with a moderate MAE (1.57 years) and RMSE (1.81 years), a positive R ² (0.367), and a lower bias (1.32 years).

Discussion: This study demonstrated that, although LLM like ChatGPT, Gemini, and DeepSeek can estimate dental age using Demirjian's scores, their performance remains inferior to the traditional method. Among them, DeepSeek-V3 showed the best results, but all models require task-specific training and validation before clinical application.

查看原文本刊更多论文

基于Demirjian方法的牙龄估算生成人工智能模型的准确性评估。

摘要牙龄估计在儿童牙科、正畸学等领域的法医鉴定、临床诊断、治疗计划和预后等方面起着关键作用。大型语言模型（LLM）在牙科领域的潜在应用越来越受到人们的认可。本研究旨在比较目前可用的生成式人工智能LLM技术在使用Demirjian评分估计牙齿年龄方面的性能。方法：采用Demirjian方法（1973）分析全景x线片，将左侧恒牙从A期分类为h期。未经训练的LLM、ChatGPT （GPT-4-turbo）、Gemini 2.0 Flash和DeepSeek-V3根据患者每颗牙的Demirjian评分估计牙龄。由于ChatGPT、Gemini和DeepSeek的概率性质，它们可以对同一个问题产生不同的回答，因此每天从每个模型中（三台不同的计算机）收集三个回答，分别在三个不同的天。从LLM获得的年龄估计值与个体的实际年龄进行了比较。使用类内相关系数（ICC）来评估审查员内部和内部的信度。使用平均绝对误差（MAE）、均方根误差（RMSE）、决定系数（r2）和偏倚来评估模型的性能。结果：30张全景x线片(女性占40%，男性占60%；平均年龄（10.4±2.32岁）。检验室内和检验室内的ICC值均超过0.85。ChatGPT和DeepSeek表现出相当但次优的性能，误差更高(MAE: 1.98-2.05年；RMSE: 2.33-2.35年)，负r2值（-0.069至-0.049），以及大量高估偏差（1.90-1.91年），表明模型拟合不良和系统缺陷。Gemini表现出中等结果，MAE中等（1.57年），RMSE中等（1.81年），r2为正（0.367），偏倚较低（1.32年）。讨论：本研究表明，尽管像ChatGPT、Gemini和DeepSeek这样的法学硕士可以使用Demirjian的分数来估计牙齿年龄，但它们的性能仍然不如传统方法。其中，DeepSeek-V3的效果最好，但所有模型在临床应用前都需要进行特定任务的训练和验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊