Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education

IF 3.4 3区医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE

Journal of periodontal research Pub Date : 2024-07-18 DOI:10.1111/jre.13323

Hamoun Sabri, Muhammad H. A. Saleh, Parham Hazrati, Keith Merchant, Jonathan Misch, Purnima S. Kumar, Hom-Lay Wang, Shayan Barootchi

{"title":"Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education","authors":"Hamoun Sabri, Muhammad H. A. Saleh, Parham Hazrati, Keith Merchant, Jonathan Misch, Purnima S. Kumar, Hom-Lay Wang, Shayan Barootchi","doi":"10.1111/jre.13323","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>The emerging rise in novel computer technologies and automated data analytics has the potential to change the course of dental education. In line with our long-term goal of harnessing the power of AI to augment didactic teaching, the objective of this study was to quantify and compare the accuracy of responses provided by ChatGPT (GPT-4 and GPT-3.5) and Google Gemini, the three primary large language models (LLMs), to human graduate students (control group) to the annual in-service examination questions posed by the American Academy of Periodontology (AAP).</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Under a comparative cross-sectional study design, a corpus of 1312 questions from the annual in-service examination of AAP administered between 2020 and 2023 were presented to the LLMs. Their responses were analyzed using chi-square tests, and the performance was juxtaposed to the scores of periodontal residents from corresponding years, as the human control group. Additionally, two sub-analyses were performed: one on the performance of the LLMs on each section of the exam; and in answering the most difficult questions.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>ChatGPT-4 (total average: 79.57%) outperformed all human control groups as well as GPT-3.5 and Google Gemini in all exam years (<i>p</i> < .001). This chatbot showed an accuracy range between 78.80% and 80.98% across the various exam years. Gemini consistently recorded superior performance with scores of 70.65% (<i>p</i> = .01), 73.29% (<i>p</i> = .02), 75.73% (<i>p</i> < .01), and 72.18% (<i>p</i> = .0008) for the exams from 2020 to 2023 compared to ChatGPT-3.5, which achieved 62.5%, 68.24%, 69.83%, and 59.27% respectively. Google Gemini (72.86%) surpassed the average scores achieved by first- (63.48% ± 31.67) and second-year residents (66.25% ± 31.61) when all exam years combined. However, it could not surpass that of third-year residents (69.06% ± 30.45).</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>Within the confines of this analysis, ChatGPT-4 exhibited a robust capability in answering AAP in-service exam questions in terms of accuracy and reliability while Gemini and ChatGPT-3.5 showed a weaker performance. These findings underscore the potential of deploying LLMs as an educational tool in periodontics and oral implantology domains. However, the current limitations of these models such as inability to effectively process image-based inquiries, the propensity for generating inconsistent responses to the same prompts, and achieving high (80% by GPT-4) but not absolute accuracy rates should be considered. An objective comparison of their capability versus their capacity is required to further develop this field of study.</p>\n </section>\n </div>","PeriodicalId":16715,"journal":{"name":"Journal of periodontal research","volume":"60 2","pages":"121-133"},"PeriodicalIF":3.4000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jre.13323","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of periodontal research","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jre.13323","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

The emerging rise in novel computer technologies and automated data analytics has the potential to change the course of dental education. In line with our long-term goal of harnessing the power of AI to augment didactic teaching, the objective of this study was to quantify and compare the accuracy of responses provided by ChatGPT (GPT-4 and GPT-3.5) and Google Gemini, the three primary large language models (LLMs), to human graduate students (control group) to the annual in-service examination questions posed by the American Academy of Periodontology (AAP).

Methods

Under a comparative cross-sectional study design, a corpus of 1312 questions from the annual in-service examination of AAP administered between 2020 and 2023 were presented to the LLMs. Their responses were analyzed using chi-square tests, and the performance was juxtaposed to the scores of periodontal residents from corresponding years, as the human control group. Additionally, two sub-analyses were performed: one on the performance of the LLMs on each section of the exam; and in answering the most difficult questions.

Results

ChatGPT-4 (total average: 79.57%) outperformed all human control groups as well as GPT-3.5 and Google Gemini in all exam years (p < .001). This chatbot showed an accuracy range between 78.80% and 80.98% across the various exam years. Gemini consistently recorded superior performance with scores of 70.65% (p = .01), 73.29% (p = .02), 75.73% (p < .01), and 72.18% (p = .0008) for the exams from 2020 to 2023 compared to ChatGPT-3.5, which achieved 62.5%, 68.24%, 69.83%, and 59.27% respectively. Google Gemini (72.86%) surpassed the average scores achieved by first- (63.48% ± 31.67) and second-year residents (66.25% ± 31.61) when all exam years combined. However, it could not surpass that of third-year residents (69.06% ± 30.45).

Conclusions

Within the confines of this analysis, ChatGPT-4 exhibited a robust capability in answering AAP in-service exam questions in terms of accuracy and reliability while Gemini and ChatGPT-3.5 showed a weaker performance. These findings underscore the potential of deploying LLMs as an educational tool in periodontics and oral implantology domains. However, the current limitations of these models such as inability to effectively process image-based inquiries, the propensity for generating inconsistent responses to the same prompts, and achieving high (80% by GPT-4) but not absolute accuracy rates should be considered. An objective comparison of their capability versus their capacity is required to further develop this field of study.

Abstract Image

查看原文本刊更多论文

三种基于人工智能（AI）的大语言模型在标准化测试中的表现；对人工智能辅助口腔医学教育的影响。

介绍：新型计算机技术和自动数据分析技术的兴起有可能改变口腔医学教育的进程。我们的长期目标是利用人工智能的力量来增强说教式教学，为了实现这一目标，本研究的目的是量化并比较 ChatGPT（GPT-4 和 GPT-3.5）和谷歌双子座这三种主要的大型语言模型（LLM）与人类研究生（对照组）对美国牙周病学会（AAP）提出的年度在职考试问题所做回答的准确性：在横断面比较研究设计下，向 LLMs 演示了来自 2020 年至 2023 年期间举行的美国牙周病学会年度在职考试的 1312 个问题的语料库。采用卡方检验对他们的回答进行分析，并将他们的成绩与相应年份的牙周病住院医师的成绩并列，作为人为对照组。此外，还进行了两项子分析：一项是法律硕士在考试各部分的表现；另一项是在回答最难问题时的表现：结果：ChatGPT-4（总平均分：79.57%）在所有考试年份中的表现都优于所有人类对照组以及 GPT-3.5 和 Google Gemini（p 结论：ChatGPT-4（总平均分：79.57%）在所有考试年份中的表现都优于 GPT-3.5 和 Google Gemini（p在本次分析范围内，ChatGPT-4 在回答 AAP 在职考试问题的准确性和可靠性方面表现出了强大的能力，而 Gemini 和 ChatGPT-3.5 则表现较弱。这些发现强调了将 LLM 作为牙周病学和口腔种植学领域教育工具的潜力。然而，应该考虑到这些模型目前存在的局限性，如无法有效处理基于图像的询问、对相同提示产生不一致回复的倾向，以及达到较高准确率（GPT-4 为 80%）但非绝对准确率。为了进一步发展这一研究领域，需要对这些模型的能力进行客观比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of periodontal research 医学-牙科与口腔外科

CiteScore

6.90

自引率

5.70%

发文量

103

审稿时长

6-12 weeks

期刊介绍： The Journal of Periodontal Research is an international research periodical the purpose of which is to publish original clinical and basic investigations and review articles concerned with every aspect of periodontology and related sciences. Brief communications (1-3 journal pages) are also accepted and a special effort is made to ensure their rapid publication. Reports of scientific meetings in periodontology and related fields are also published. One volume of six issues is published annually.