Accuracy and consistency of ChatGPT-3.5 and - 4 in providing differential diagnoses in oral and maxillofacial diseases: a comparative diagnostic performance analysis.

IF 3.1 2区医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE

Clinical Oral Investigations Pub Date : 2024-09-24 DOI:10.1007/s00784-024-05939-1

Saygo Tomo, Jérôme R Lechien, Hugo Sobrinho Bueno, Daniela Filié Cantieri-Debortoli, Luciana Estevam Simonato

{"title":"Accuracy and consistency of ChatGPT-3.5 and - 4 in providing differential diagnoses in oral and maxillofacial diseases: a comparative diagnostic performance analysis.","authors":"Saygo Tomo, Jérôme R Lechien, Hugo Sobrinho Bueno, Daniela Filié Cantieri-Debortoli, Luciana Estevam Simonato","doi":"10.1007/s00784-024-05939-1","DOIUrl":null,"url":null,"abstract":"Objective: To investigate the performance of ChatGPT in the differential diagnosis of oral and maxillofacial diseases.Methods: Thirty-seven oral and maxillofacial lesions findings were presented to ChatGPT-3.5 and - 4, 18 dental surgeons trained in oral medicine/pathology (OMP), 23 general dental surgeons (DDS), and 16 dental students (DS) for differential diagnosis. Additionally, a group of 15 general dentists was asked to describe 11 cases to ChatGPT versions. The ChatGPT-3.5, -4, and human primary and alternative diagnoses were rated by 2 independent investigators with a 4 Likert-Scale. The consistency of ChatGPT-3.5 and - 4 was evaluated with regenerated inputs.Results: Moderate consistency of outputs was observed for ChatGPT-3.5 and - 4 to provide primary (κ = 0.532 and κ = 0.533 respectively) and alternative (κ = 0.337 and κ = 0.367 respectively) hypotheses. The mean of correct diagnoses was 64.86% for ChatGPT-3.5, 80.18% for ChatGPT-4, 86.64% for OMP, 24.32% for DDS, and 16.67% for DS. The mean correct primary hypothesis rates were 45.95% for ChatGPT-3.5, 61.80% for ChatGPT-4, 82.28% for OMP, 22.72% for DDS, and 15.77% for DS. The mean correct diagnosis rate for ChatGPT-3.5 with standard descriptions was 64.86%, compared to 45.95% with participants' descriptions. For ChatGPT-4, the mean was 80.18% with standard descriptions and 61.80% with participant descriptions.Conclusion: ChatGPT-4 demonstrates an accuracy comparable to specialists to provide differential diagnosis for oral and maxillofacial diseases. Consistency of ChatGPT to provide diagnostic hypotheses for oral diseases cases is moderate, representing a weakness for clinical application. The quality of case documentation and descriptions impacts significantly on the performance of ChatGPT.Clinical relevance: General dentists, dental students and specialists in oral medicine and pathology may benefit from ChatGPT-4 as an auxiliary method to define differential diagnosis for oral and maxillofacial lesions, but its accuracy is dependent on precise case descriptions.","PeriodicalId":10461,"journal":{"name":"Clinical Oral Investigations","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Oral Investigations","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00784-024-05939-1","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To investigate the performance of ChatGPT in the differential diagnosis of oral and maxillofacial diseases.

Methods: Thirty-seven oral and maxillofacial lesions findings were presented to ChatGPT-3.5 and - 4, 18 dental surgeons trained in oral medicine/pathology (OMP), 23 general dental surgeons (DDS), and 16 dental students (DS) for differential diagnosis. Additionally, a group of 15 general dentists was asked to describe 11 cases to ChatGPT versions. The ChatGPT-3.5, -4, and human primary and alternative diagnoses were rated by 2 independent investigators with a 4 Likert-Scale. The consistency of ChatGPT-3.5 and - 4 was evaluated with regenerated inputs.

Results: Moderate consistency of outputs was observed for ChatGPT-3.5 and - 4 to provide primary (κ = 0.532 and κ = 0.533 respectively) and alternative (κ = 0.337 and κ = 0.367 respectively) hypotheses. The mean of correct diagnoses was 64.86% for ChatGPT-3.5, 80.18% for ChatGPT-4, 86.64% for OMP, 24.32% for DDS, and 16.67% for DS. The mean correct primary hypothesis rates were 45.95% for ChatGPT-3.5, 61.80% for ChatGPT-4, 82.28% for OMP, 22.72% for DDS, and 15.77% for DS. The mean correct diagnosis rate for ChatGPT-3.5 with standard descriptions was 64.86%, compared to 45.95% with participants' descriptions. For ChatGPT-4, the mean was 80.18% with standard descriptions and 61.80% with participant descriptions.

Conclusion: ChatGPT-4 demonstrates an accuracy comparable to specialists to provide differential diagnosis for oral and maxillofacial diseases. Consistency of ChatGPT to provide diagnostic hypotheses for oral diseases cases is moderate, representing a weakness for clinical application. The quality of case documentation and descriptions impacts significantly on the performance of ChatGPT.

Clinical relevance: General dentists, dental students and specialists in oral medicine and pathology may benefit from ChatGPT-4 as an auxiliary method to define differential diagnosis for oral and maxillofacial lesions, but its accuracy is dependent on precise case descriptions.

查看原文本刊更多论文

ChatGPT-3.5 和 - 4 在提供口腔颌面部疾病鉴别诊断方面的准确性和一致性：诊断性能比较分析。

目的：研究 ChatGPT 在口腔颌面部疾病鉴别诊断中的性能：研究 ChatGPT 在口腔颌面部疾病鉴别诊断中的表现：向 ChatGPT-3.5 和 - 4、18 名接受过口腔内科/病理学培训的牙科医生 (OMP)、23 名普通牙科医生 (DDS) 和 16 名牙科学生 (DS) 提交了 37 个口腔颌面部病变结果，以进行鉴别诊断。此外，一组 15 名普通牙科医生被要求向 ChatGPT 版本描述 11 个病例。ChatGPT 3.5、-4 和人类主要诊断和替代诊断由两名独立调查员用 4 分李克特量表进行评分。结果显示，ChatGPT-3.5 和 ChatGPT - 4 的输出结果具有中等程度的一致性：结果：ChatGPT-3.5 和 - 4 在提供主要假设（κ = 0.532 和 κ = 0.533）和替代假设（κ = 0.337 和 κ = 0.367）方面的输出具有适度的一致性。ChatGPT-3.5 的平均诊断正确率为 64.86%，ChatGPT-4 为 80.18%，OMP 为 86.64%，DDS 为 24.32%，DS 为 16.67%。ChatGPT-3.5 的平均初级假设正确率为 45.95%，ChatGPT-4 为 61.80%，OMP 为 82.28%，DDS 为 22.72%，DS 为 15.77%。使用标准描述的 ChatGPT-3.5 的平均正确诊断率为 64.86%，而使用参与者描述的正确诊断率为 45.95%。对于 ChatGPT-4，使用标准描述的平均正确率为 80.18%，而使用参与者描述的正确率为 61.80%：结论：ChatGPT-4 在为口腔颌面部疾病提供鉴别诊断方面的准确性可与专家媲美。ChatGPT 为口腔疾病病例提供诊断假设的一致性一般，是临床应用的一个薄弱环节。病例记录和描述的质量对 ChatGPT 的性能影响很大：临床相关性：普通牙医、牙科学生以及口腔医学和病理学专家可能会从 ChatGPT-4 中获益，将其作为确定口腔颌面部病变鉴别诊断的辅助方法，但其准确性取决于精确的病例描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical Oral Investigations 医学-牙科与口腔外科

CiteScore

6.30

自引率

5.90%

发文量

484

审稿时长

3 months

期刊介绍： The journal Clinical Oral Investigations is a multidisciplinary, international forum for publication of research from all fields of oral medicine. The journal publishes original scientific articles and invited reviews which provide up-to-date results of basic and clinical studies in oral and maxillofacial science and medicine. The aim is to clarify the relevance of new results to modern practice, for an international readership. Coverage includes maxillofacial and oral surgery, prosthetics and restorative dentistry, operative dentistry, endodontics, periodontology, orthodontics, dental materials science, clinical trials, epidemiology, pedodontics, oral implant, preventive dentistiry, oral pathology, oral basic sciences and more.