Chad Chan-Chia Lin , Jui-Sheng Sun , Chin-Hao Chang , Yu-Han Chang , Jenny Zwei-Chieng Chang
{"title":"Performance of artificial intelligence chatbots in National dental licensing examination","authors":"Chad Chan-Chia Lin , Jui-Sheng Sun , Chin-Hao Chang , Yu-Han Chang , Jenny Zwei-Chieng Chang","doi":"10.1016/j.jds.2025.05.012","DOIUrl":null,"url":null,"abstract":"<div><h3>Background/purpose</h3><div>The Taiwan dental board exams comprehensively assess dental candidates across twenty distinct subjects, spanning foundational knowledge to clinical fields, using multiple-choice single-answer exams with a minimum passing score of 60 %. This study assesses the performance of artificial intelligence (AI)-powered chatbots (specifically ChatGPT3.5, Gemini, and Claude2), categorized as Large Language Models (LLMs), on these exams from 2021 to 2023.</div></div><div><h3>Materials and methods</h3><div>A total of 2699 multiple-choice questions spanning eight subjects in basic dentistry and twelve in clinical dentistry were analyzed. Questions involving images and tables were excluded. Statistical analyses were conducted using McNemar's test. Furthermore, annual results of LLMs were compared with the qualification rates of human candidates to provide additional context.</div></div><div><h3>Results</h3><div>Claude2 demonstrated the highest overall accuracy (54.89 %) on the Taiwan national dental licensing examinations, outperforming ChatGPT3.5 (49.33 %) and Gemini (44.63 %), with statistically significant differences in performance across models. In the basic dentistry domain, Claude2 scored 59.73 %, followed by ChatGPT3.5 (54.87 %) and Gemini (47.35 %). Notably, Claude2 excelled in biochemistry (73.81 %) and oral microbiology (88.89 %), while ChatGPT3.5 also performed strongly in oral microbiology (80.56 %). In the clinical dentistry domain, Claude2 led with a score of 52.45 %, surpassing ChatGPT3.5 (46.54 %) and Gemini (43.26 %), and showed strong results in dental public health (65.81 %). Despite these achievements, none of the LLMs attained passing scores overall.</div></div><div><h3>Conclusion</h3><div>None of the models achieved passing scores, highlighting their strengths in foundational knowledge but limitations in clinical reasoning.</div></div>","PeriodicalId":15583,"journal":{"name":"Journal of Dental Sciences","volume":"20 4","pages":"Pages 2307-2314"},"PeriodicalIF":3.1000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Dental Sciences","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1991790225001606","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Background/purpose
The Taiwan dental board exams comprehensively assess dental candidates across twenty distinct subjects, spanning foundational knowledge to clinical fields, using multiple-choice single-answer exams with a minimum passing score of 60 %. This study assesses the performance of artificial intelligence (AI)-powered chatbots (specifically ChatGPT3.5, Gemini, and Claude2), categorized as Large Language Models (LLMs), on these exams from 2021 to 2023.
Materials and methods
A total of 2699 multiple-choice questions spanning eight subjects in basic dentistry and twelve in clinical dentistry were analyzed. Questions involving images and tables were excluded. Statistical analyses were conducted using McNemar's test. Furthermore, annual results of LLMs were compared with the qualification rates of human candidates to provide additional context.
Results
Claude2 demonstrated the highest overall accuracy (54.89 %) on the Taiwan national dental licensing examinations, outperforming ChatGPT3.5 (49.33 %) and Gemini (44.63 %), with statistically significant differences in performance across models. In the basic dentistry domain, Claude2 scored 59.73 %, followed by ChatGPT3.5 (54.87 %) and Gemini (47.35 %). Notably, Claude2 excelled in biochemistry (73.81 %) and oral microbiology (88.89 %), while ChatGPT3.5 also performed strongly in oral microbiology (80.56 %). In the clinical dentistry domain, Claude2 led with a score of 52.45 %, surpassing ChatGPT3.5 (46.54 %) and Gemini (43.26 %), and showed strong results in dental public health (65.81 %). Despite these achievements, none of the LLMs attained passing scores overall.
Conclusion
None of the models achieved passing scores, highlighting their strengths in foundational knowledge but limitations in clinical reasoning.
期刊介绍:
he Journal of Dental Sciences (JDS), published quarterly, is the official and open access publication of the Association for Dental Sciences of the Republic of China (ADS-ROC). The precedent journal of the JDS is the Chinese Dental Journal (CDJ) which had already been covered by MEDLINE in 1988. As the CDJ continued to prove its importance in the region, the ADS-ROC decided to move to the international community by publishing an English journal. Hence, the birth of the JDS in 2006. The JDS is indexed in the SCI Expanded since 2008. It is also indexed in Scopus, and EMCare, ScienceDirect, SIIC Data Bases.
The topics covered by the JDS include all fields of basic and clinical dentistry. Some manuscripts focusing on the study of certain endemic diseases such as dental caries and periodontal diseases in particular regions of any country as well as oral pre-cancers, oral cancers, and oral submucous fibrosis related to betel nut chewing habit are also considered for publication. Besides, the JDS also publishes articles about the efficacy of a new treatment modality on oral verrucous hyperplasia or early oral squamous cell carcinoma.