生成大语言人工智能模型在牙齿拥挤评估中的诊断准确性。

IF 3.1 2区 医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE
Khaled Wafaie, Mohamed E Basyouni, Tanmoy Bhattacharjee, Sabarinath Prasad, Baraa Daraqel, Hisham Mohammed
{"title":"生成大语言人工智能模型在牙齿拥挤评估中的诊断准确性。","authors":"Khaled Wafaie, Mohamed E Basyouni, Tanmoy Bhattacharjee, Sabarinath Prasad, Baraa Daraqel, Hisham Mohammed","doi":"10.1186/s12903-025-06960-w","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Generative artificial intelligence (AI) models have shown potential for addressing text-based dental enquiries and answering exam questions. However, their role in diagnosis and treatment planning has not been thoroughly investigated. This study aimed to investigate the reliability of different generative AI models in classifying the severity of dental crowding.</p><p><strong>Methods: </strong>Two experienced orthodontists categorized the severity of dental crowding in 120 intraoral occlusal images as mild, moderate, or severe (40 images per category). These images were then uploaded to three generative AI models (ChatGPT-4o mini, Microsoft Copilot, and Claude 3.5 Sonnet) and prompted to identify the dental arch and to assess the severity of dental crowding. Response times were recorded, and outputs were compared to orthodontists' assessments. A random image subset was re-analyzed after one week to evaluate model consistency.</p><p><strong>Results: </strong>Claude 3.5 Sonnet successfully classified the severity of dental crowding in 50% of the images, followed by ChatGPT-4o mini (44%), and Copilot (34%). Visual recognition of the dental arches was higher with Claude and ChatGPT-4o mini (99%) compared to Copilot (72%). Response generation was significantly longer for ChatGPT-4o mini than for Claude and Copilot (p < .0001). However, the response times were comparable for both Claude and Copilot (p = .98). Repeated analyses showed improvement in image classification for both ChatGPT-4o mini and Copilot, while Claude 3.5 Sonnet misclassified a significant portion of the images.</p><p><strong>Conclusions: </strong>The performance of ChatGPT-4o mini-, Microsoft Copilot, and Claude 3.5 Sonnet in analyzing the severity of dental crowding often did not match the evaluations made by orthodontists. Further developments in image processing algorithms of commercially available generative AI models are required prior to reliable use for dental crowding classification.</p>","PeriodicalId":9072,"journal":{"name":"BMC Oral Health","volume":"25 1","pages":"1558"},"PeriodicalIF":3.1000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12505568/pdf/","citationCount":"0","resultStr":"{\"title\":\"Diagnostic accuracy of generative large language artificial intelligence models for the assessment of dental crowding.\",\"authors\":\"Khaled Wafaie, Mohamed E Basyouni, Tanmoy Bhattacharjee, Sabarinath Prasad, Baraa Daraqel, Hisham Mohammed\",\"doi\":\"10.1186/s12903-025-06960-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Generative artificial intelligence (AI) models have shown potential for addressing text-based dental enquiries and answering exam questions. However, their role in diagnosis and treatment planning has not been thoroughly investigated. This study aimed to investigate the reliability of different generative AI models in classifying the severity of dental crowding.</p><p><strong>Methods: </strong>Two experienced orthodontists categorized the severity of dental crowding in 120 intraoral occlusal images as mild, moderate, or severe (40 images per category). These images were then uploaded to three generative AI models (ChatGPT-4o mini, Microsoft Copilot, and Claude 3.5 Sonnet) and prompted to identify the dental arch and to assess the severity of dental crowding. Response times were recorded, and outputs were compared to orthodontists' assessments. A random image subset was re-analyzed after one week to evaluate model consistency.</p><p><strong>Results: </strong>Claude 3.5 Sonnet successfully classified the severity of dental crowding in 50% of the images, followed by ChatGPT-4o mini (44%), and Copilot (34%). Visual recognition of the dental arches was higher with Claude and ChatGPT-4o mini (99%) compared to Copilot (72%). Response generation was significantly longer for ChatGPT-4o mini than for Claude and Copilot (p < .0001). However, the response times were comparable for both Claude and Copilot (p = .98). Repeated analyses showed improvement in image classification for both ChatGPT-4o mini and Copilot, while Claude 3.5 Sonnet misclassified a significant portion of the images.</p><p><strong>Conclusions: </strong>The performance of ChatGPT-4o mini-, Microsoft Copilot, and Claude 3.5 Sonnet in analyzing the severity of dental crowding often did not match the evaluations made by orthodontists. Further developments in image processing algorithms of commercially available generative AI models are required prior to reliable use for dental crowding classification.</p>\",\"PeriodicalId\":9072,\"journal\":{\"name\":\"BMC Oral Health\",\"volume\":\"25 1\",\"pages\":\"1558\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12505568/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Oral Health\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12903-025-06960-w\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Oral Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12903-025-06960-w","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

摘要

背景:生成人工智能(AI)模型已经显示出解决基于文本的牙科查询和回答考试问题的潜力。然而,它们在诊断和治疗计划中的作用尚未得到彻底调查。本研究旨在探讨不同生成人工智能模型在分类牙齿拥挤严重程度方面的可靠性。方法:两位经验丰富的正畸医生将120张口腔内咬合图像中的牙齿拥挤程度分为轻度、中度和重度(每类40张图像)。然后将这些图像上传到三个生成式人工智能模型(chatgpt - 40 mini、Microsoft Copilot和Claude 3.5 Sonnet),并提示识别牙弓并评估牙齿拥挤的严重程度。记录响应时间,并将输出结果与正畸医生的评估进行比较。一周后重新分析随机图像子集以评估模型一致性。结果:Claude 3.5 Sonnet在50%的图像中成功地分类了牙齿拥挤的严重程度,其次是chatgpt - 40 mini(44%)和Copilot(34%)。与Copilot(72%)相比,Claude和chatgpt - 40 mini对牙弓的视觉识别(99%)更高。结论:chatgpt - 40 mini、Microsoft Copilot和Claude 3.5 Sonnet在分析牙齿拥挤严重程度时的表现往往与正畸医生的评估不匹配。在可靠地用于牙齿拥挤分类之前,需要进一步开发商用生成人工智能模型的图像处理算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Diagnostic accuracy of generative large language artificial intelligence models for the assessment of dental crowding.

Background: Generative artificial intelligence (AI) models have shown potential for addressing text-based dental enquiries and answering exam questions. However, their role in diagnosis and treatment planning has not been thoroughly investigated. This study aimed to investigate the reliability of different generative AI models in classifying the severity of dental crowding.

Methods: Two experienced orthodontists categorized the severity of dental crowding in 120 intraoral occlusal images as mild, moderate, or severe (40 images per category). These images were then uploaded to three generative AI models (ChatGPT-4o mini, Microsoft Copilot, and Claude 3.5 Sonnet) and prompted to identify the dental arch and to assess the severity of dental crowding. Response times were recorded, and outputs were compared to orthodontists' assessments. A random image subset was re-analyzed after one week to evaluate model consistency.

Results: Claude 3.5 Sonnet successfully classified the severity of dental crowding in 50% of the images, followed by ChatGPT-4o mini (44%), and Copilot (34%). Visual recognition of the dental arches was higher with Claude and ChatGPT-4o mini (99%) compared to Copilot (72%). Response generation was significantly longer for ChatGPT-4o mini than for Claude and Copilot (p < .0001). However, the response times were comparable for both Claude and Copilot (p = .98). Repeated analyses showed improvement in image classification for both ChatGPT-4o mini and Copilot, while Claude 3.5 Sonnet misclassified a significant portion of the images.

Conclusions: The performance of ChatGPT-4o mini-, Microsoft Copilot, and Claude 3.5 Sonnet in analyzing the severity of dental crowding often did not match the evaluations made by orthodontists. Further developments in image processing algorithms of commercially available generative AI models are required prior to reliable use for dental crowding classification.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
BMC Oral Health
BMC Oral Health DENTISTRY, ORAL SURGERY & MEDICINE-
CiteScore
3.90
自引率
6.90%
发文量
481
审稿时长
6-12 weeks
期刊介绍: BMC Oral Health is an open access, peer-reviewed journal that considers articles on all aspects of the prevention, diagnosis and management of disorders of the mouth, teeth and gums, as well as related molecular genetics, pathophysiology, and epidemiology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信