The Diagnostic Performance of Large Language Models and Oral Medicine Consultants for Identifying Oral Lesions in Text-Based Clinical Scenarios: Prospective Comparative Study.

IF 2
JMIR AI Pub Date : 2025-04-24 DOI:10.2196/70566
Sarah AlFarabi Ali, Hebah AlDehlawi, Ahoud Jazzar, Heba Ashi, Nihal Esam Abuzinadah, Mohammad AlOtaibi, Abdulrahman Algarni, Hazzaa Alqahtani, Sara Akeel, Soulafa Almazrooa
{"title":"The Diagnostic Performance of Large Language Models and Oral Medicine Consultants for Identifying Oral Lesions in Text-Based Clinical Scenarios: Prospective Comparative Study.","authors":"Sarah AlFarabi Ali, Hebah AlDehlawi, Ahoud Jazzar, Heba Ashi, Nihal Esam Abuzinadah, Mohammad AlOtaibi, Abdulrahman Algarni, Hazzaa Alqahtani, Sara Akeel, Soulafa Almazrooa","doi":"10.2196/70566","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The use of artificial intelligence (AI), especially large language models (LLMs), is increasing in health care, including in dentistry. There has yet to be an assessment of the diagnostic performance of LLMs in oral medicine.</p><p><strong>Objective: </strong>We aimed to compare the effectiveness of ChatGPT (OpenAI) and Microsoft Copilot (integrated within the Microsoft 365 suite) with oral medicine consultants in formulating accurate differential and final diagnoses for oral lesions from written clinical scenarios.</p><p><strong>Methods: </strong>Fifty comprehensive clinical case scenarios including patient age, presenting complaint, history of the presenting complaint, medical history, allergies, intra- and extraoral findings, lesion description, and any additional information including laboratory investigations and specific clinical features were given to three oral medicine consultants, who were asked to formulate a differential diagnosis and a final diagnosis. Specific prompts for the same 50 cases were designed and input into ChatGPT and Copilot to formulate both differential and final diagnoses. The diagnostic accuracy was compared between the LLMs and oral medicine consultants.</p><p><strong>Results: </strong>ChatGPT exhibited the highest accuracy, providing the correct differential diagnoses in 37 of 50 cases (74%). There were no significant differences in the accuracy of providing the correct differential diagnoses between AI models and oral medicine consultants. ChatGPT was as accurate as consultants in making the final diagnoses, but Copilot was significantly less accurate than ChatGPT (P=.015) and one of the oral medicine consultants (P<.001) in providing the correct final diagnosis.</p><p><strong>Conclusions: </strong>ChatGPT and Copilot show promising performance for diagnosing oral medicine pathology in clinical case scenarios to assist dental practitioners. ChatGPT-4 and Copilot are still evolving, but even now, they might provide a significant advantage in the clinical setting as tools to help dental practitioners in their daily practice.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e70566"},"PeriodicalIF":2.0000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12223689/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/70566","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The use of artificial intelligence (AI), especially large language models (LLMs), is increasing in health care, including in dentistry. There has yet to be an assessment of the diagnostic performance of LLMs in oral medicine.

Objective: We aimed to compare the effectiveness of ChatGPT (OpenAI) and Microsoft Copilot (integrated within the Microsoft 365 suite) with oral medicine consultants in formulating accurate differential and final diagnoses for oral lesions from written clinical scenarios.

Methods: Fifty comprehensive clinical case scenarios including patient age, presenting complaint, history of the presenting complaint, medical history, allergies, intra- and extraoral findings, lesion description, and any additional information including laboratory investigations and specific clinical features were given to three oral medicine consultants, who were asked to formulate a differential diagnosis and a final diagnosis. Specific prompts for the same 50 cases were designed and input into ChatGPT and Copilot to formulate both differential and final diagnoses. The diagnostic accuracy was compared between the LLMs and oral medicine consultants.

Results: ChatGPT exhibited the highest accuracy, providing the correct differential diagnoses in 37 of 50 cases (74%). There were no significant differences in the accuracy of providing the correct differential diagnoses between AI models and oral medicine consultants. ChatGPT was as accurate as consultants in making the final diagnoses, but Copilot was significantly less accurate than ChatGPT (P=.015) and one of the oral medicine consultants (P<.001) in providing the correct final diagnosis.

Conclusions: ChatGPT and Copilot show promising performance for diagnosing oral medicine pathology in clinical case scenarios to assist dental practitioners. ChatGPT-4 and Copilot are still evolving, but even now, they might provide a significant advantage in the clinical setting as tools to help dental practitioners in their daily practice.

Abstract Image

大型语言模型和口腔医学顾问在基于文本的临床场景中识别口腔病变的诊断性能:前瞻性比较研究。
背景:人工智能(AI),特别是大型语言模型(llm)在医疗保健领域的应用正在增加,包括牙科领域。目前还没有对llm在口腔医学中的诊断性能进行评估。目的:我们旨在比较ChatGPT (OpenAI)和Microsoft Copilot(集成在Microsoft 365套件中)与口腔医学顾问在根据书面临床场景制定准确的口腔病变鉴别和最终诊断方面的有效性。方法:将患者年龄、主诉、主诉史、病史、过敏史、口内口外表现、病变描述以及实验室检查和具体临床特征等附加信息等50个综合临床病例情况提供给3名口腔医学顾问,由其制定鉴别诊断和最终诊断。针对同样的50个病例设计了具体的提示,并输入到ChatGPT和Copilot中,以制定鉴别诊断和最终诊断。比较llm和口腔医学顾问的诊断准确性。结果:ChatGPT表现出最高的准确性,在50例中提供了37例(74%)的正确鉴别诊断。人工智能模型和口腔医学顾问在提供正确鉴别诊断的准确性方面没有显著差异。ChatGPT的最终诊断准确率与会诊医师相当,而Copilot的最终诊断准确率明显低于ChatGPT (P= 0.015)和其中一名口腔医学会诊医师(P= 0.015)。结论:ChatGPT和Copilot在临床病例情景下对口腔医学病理进行诊断,以协助牙科医生。ChatGPT-4和Copilot仍在不断发展,但即使是现在,它们也可能在临床环境中提供显著的优势,作为帮助牙科医生日常实践的工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信