Performance of a Vision-Language Model in Detecting Common Dental Conditions on Panoramic Radiographs Using Different Tooth Numbering Systems.

IF 3.3 3区 医学 Q1 MEDICINE, GENERAL & INTERNAL
Zekai Liu, Qi Yong H Ai, Andy Wai Kan Yeung, Ray Tanaka, Andrew Nalley, Kuo Feng Hung
{"title":"Performance of a Vision-Language Model in Detecting Common Dental Conditions on Panoramic Radiographs Using Different Tooth Numbering Systems.","authors":"Zekai Liu, Qi Yong H Ai, Andy Wai Kan Yeung, Ray Tanaka, Andrew Nalley, Kuo Feng Hung","doi":"10.3390/diagnostics15182315","DOIUrl":null,"url":null,"abstract":"<p><p><b>Objectives</b>: The aim of this study was to evaluate the performance of GPT-4o in identifying nine common dental conditions on panoramic radiographs, both overall and at specific tooth sites, and to assess whether the use of different tooth numbering systems (FDI and Universal) in prompts would affect its diagnostic accuracy. <b>Methods</b>: Fifty panoramic radiographs exhibiting various common dental conditions including missing teeth, impacted teeth, caries, endodontically treated teeth, teeth with restorations, periapical lesions, periodontal bone loss, tooth fractures, cracks, retained roots, dental implants, osteolytic lesions, and osteosclerosis were included. Each image was evaluated twice by GPT-4o in May 2025, using structured prompts based on either the FDI or Universal tooth numbering system, to identify the presence of these conditions at specific tooth sites or regions. GPT-4o responses were compared to a consensus reference standard established by an oral-maxillofacial radiology team. GPT-4o's performance was evaluated using balanced accuracy, sensitivity, specificity, and F1 score both at the patient and tooth levels. <b>Results</b>: A total of 100 GPT-4o responses were generated. At the patient level, balanced accuracy ranged from 46.25% to 98.83% (FDI) and 49.75% to 92.86% (Universal), with the highest accuracies for dental implants (92.86-98.83%). F1-scores and sensitivities were highest for implants, missing, and impacted teeth, but zero for caries, periapical lesions, and fractures. Specificity was generally high across conditions. Notable discrepancies were observed between patient- and tooth-level performance, especially for implants and restorations. GPT-4o's performance was similar between using the two numbering systems. <b>Conclusions</b>: GPT-4o demonstrated superior performance in detecting dental implants and treated or restored teeth but inferior performance for caries, periapical lesions, and fractures. Diagnostic accuracy was higher at the patient level than at the tooth level, with similar performances for both numbering systems. Future studies with larger, more diverse datasets and multiple models are needed.</p>","PeriodicalId":11225,"journal":{"name":"Diagnostics","volume":"15 18","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12468776/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/diagnostics15182315","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: The aim of this study was to evaluate the performance of GPT-4o in identifying nine common dental conditions on panoramic radiographs, both overall and at specific tooth sites, and to assess whether the use of different tooth numbering systems (FDI and Universal) in prompts would affect its diagnostic accuracy. Methods: Fifty panoramic radiographs exhibiting various common dental conditions including missing teeth, impacted teeth, caries, endodontically treated teeth, teeth with restorations, periapical lesions, periodontal bone loss, tooth fractures, cracks, retained roots, dental implants, osteolytic lesions, and osteosclerosis were included. Each image was evaluated twice by GPT-4o in May 2025, using structured prompts based on either the FDI or Universal tooth numbering system, to identify the presence of these conditions at specific tooth sites or regions. GPT-4o responses were compared to a consensus reference standard established by an oral-maxillofacial radiology team. GPT-4o's performance was evaluated using balanced accuracy, sensitivity, specificity, and F1 score both at the patient and tooth levels. Results: A total of 100 GPT-4o responses were generated. At the patient level, balanced accuracy ranged from 46.25% to 98.83% (FDI) and 49.75% to 92.86% (Universal), with the highest accuracies for dental implants (92.86-98.83%). F1-scores and sensitivities were highest for implants, missing, and impacted teeth, but zero for caries, periapical lesions, and fractures. Specificity was generally high across conditions. Notable discrepancies were observed between patient- and tooth-level performance, especially for implants and restorations. GPT-4o's performance was similar between using the two numbering systems. Conclusions: GPT-4o demonstrated superior performance in detecting dental implants and treated or restored teeth but inferior performance for caries, periapical lesions, and fractures. Diagnostic accuracy was higher at the patient level than at the tooth level, with similar performances for both numbering systems. Future studies with larger, more diverse datasets and multiple models are needed.

使用不同牙齿编号系统的视觉语言模型在全景x光片上检测常见牙齿状况的性能。
目的:本研究的目的是评估gpt - 40在全景x线片上识别九种常见牙齿疾病的性能,包括整体和特定牙齿部位,并评估在提示中使用不同的牙齿编号系统(FDI和Universal)是否会影响其诊断准确性。方法:包括50张全景x线片,显示各种常见的牙齿疾病,包括缺牙、阻生牙、龋齿、经牙髓治疗的牙齿、修复的牙齿、根尖周围病变、牙周骨丢失、牙齿骨折、裂缝、牙根保留、种植体、溶骨病变和骨硬化。2025年5月,gpt - 40使用基于FDI或通用牙齿编号系统的结构化提示对每张图像进行两次评估,以确定特定牙齿部位或区域是否存在这些情况。将gpt - 40反应与口腔颌面放射学团队建立的共识参考标准进行比较。gpt - 40的性能在患者和牙齿水平上使用平衡的准确性、敏感性、特异性和F1评分进行评估。结果:共产生100例gpt - 40应答。在患者水平上,FDI的平衡准确率为46.25% ~ 98.83%,Universal的平衡准确率为49.75% ~ 92.86%,其中种植体的平衡准确率最高(92.86 ~ 98.83%)。植牙、缺牙和阻生牙的f1评分和敏感性最高,而龋齿、根尖周病变和骨折的f1评分和敏感性为零。在不同条件下特异性普遍较高。在患者和牙齿水平的表现之间观察到明显的差异,特别是在种植体和修复体方面。gpt - 40的性能在使用两种编号系统之间是相似的。结论:gpt - 40在检测种植体和治疗或修复牙齿方面表现优异,但在龋齿、根尖周病变和骨折方面表现较差。诊断准确性在患者水平高于在牙齿水平,具有相似的性能为两个编号系统。未来的研究需要更大、更多样化的数据集和多种模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Diagnostics
Diagnostics Biochemistry, Genetics and Molecular Biology-Clinical Biochemistry
CiteScore
4.70
自引率
8.30%
发文量
2699
审稿时长
19.64 days
期刊介绍: Diagnostics (ISSN 2075-4418) is an international scholarly open access journal on medical diagnostics. It publishes original research articles, reviews, communications and short notes on the research and development of medical diagnostics. There is no restriction on the length of the papers. Our aim is to encourage scientists to publish their experimental and theoretical research in as much detail as possible. Full experimental and/or methodological details must be provided for research articles.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信