7个人工智能聊天机器人在板状牙髓问题上的表现。

IF 3.6 2区医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE

Journal of endodontics Pub Date : 2025-10-01 DOI:10.1016/j.joen.2025.06.014

Poorya Jalali DDS , Hossein Mohammad-Rahimi DDS , Feng-Ming Wang DDS, PhD , Fatemeh Sohrabniya DDS , Seyed AmirHossein Ourang DDS , Yuke Tian BS, MS , Frederico C. Martinho DDS, MSc, PhD , Ali Nosrat DDS, MS, MDS

{"title":"7个人工智能聊天机器人在板状牙髓问题上的表现。","authors":"Poorya Jalali DDS , Hossein Mohammad-Rahimi DDS , Feng-Ming Wang DDS, PhD , Fatemeh Sohrabniya DDS , Seyed AmirHossein Ourang DDS , Yuke Tian BS, MS , Frederico C. Martinho DDS, MSc, PhD , Ali Nosrat DDS, MS, MDS","doi":"10.1016/j.joen.2025.06.014","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>The aim of this study was to assess the overall performance of artificial intelligence chatbots in answering board-style endodontic questions.</div></div><div><h3>Methods</h3><div>One hundred multiple choice endodontic questions, following the style of American Board of Endodontics Written Exam, were generated by two board-certified endodontists. The questions were submitted to the following chatbots, three times in a row: Gemini Advanced, Gemini, Microsoft Copilot, GPT-3.5, GPT-4o, GPT-4.0, and Claude 3.5 Sonnet. The chatbot was asked to choose the correct response and to explain the justification. The response to the question was considered “correct” only if the chatbot picked the right choice in ALL 3 attempts. The quality of reasoning as to why the chatbot selected the answer choice was scored using a three-ordinal scale (0, 1, 2). Two calibrated reviewers scored all 2100 responses independently. Categorical data were analyzed using Chi-square test; ordinal data were analyzed using Kruskal–Wallis and Mann–Whitney tests.</div></div><div><h3>Results</h3><div>The accuracy scores ranged from 48% (Microsoft Copilot) to 71% (Gemini Advanced, GPT-3.5, and Claude 3.5 Sonnet) (<em>P</em> < .05). Gemini Advanced, Gemini, and Microsoft Copilot showed similar performance regardless of the question source (textbook or literature) (<em>P</em> > .05). GPT-3.5, GPT-4o, GPT-4.0 and Claude 3.5 Sonnet performed significantly better with textbook-based questions (<em>P</em> < .05). Reasoning scores showed different distribution among chatbots (<em>P</em> < .05). Gemini Advanced had the highest rate of score 2 (81%) and the lowest rate of score 0 (18.5%).</div></div><div><h3>Conclusions</h3><div>Comprehensive assessment of seven AI chatbots’ performance on board-style endodontic questions revealed their capacities and limitations as educational resources in the field of endodontics.</div></div>","PeriodicalId":15703,"journal":{"name":"Journal of endodontics","volume":"51 10","pages":"Pages 1413-1419"},"PeriodicalIF":3.6000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of 7 Artificial Intelligence Chatbots on Board-style Endodontic Questions\",\"authors\":\"Poorya Jalali DDS , Hossein Mohammad-Rahimi DDS , Feng-Ming Wang DDS, PhD , Fatemeh Sohrabniya DDS , Seyed AmirHossein Ourang DDS , Yuke Tian BS, MS , Frederico C. Martinho DDS, MSc, PhD , Ali Nosrat DDS, MS, MDS\",\"doi\":\"10.1016/j.joen.2025.06.014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>The aim of this study was to assess the overall performance of artificial intelligence chatbots in answering board-style endodontic questions.</div></div><div><h3>Methods</h3><div>One hundred multiple choice endodontic questions, following the style of American Board of Endodontics Written Exam, were generated by two board-certified endodontists. The questions were submitted to the following chatbots, three times in a row: Gemini Advanced, Gemini, Microsoft Copilot, GPT-3.5, GPT-4o, GPT-4.0, and Claude 3.5 Sonnet. The chatbot was asked to choose the correct response and to explain the justification. The response to the question was considered “correct” only if the chatbot picked the right choice in ALL 3 attempts. The quality of reasoning as to why the chatbot selected the answer choice was scored using a three-ordinal scale (0, 1, 2). Two calibrated reviewers scored all 2100 responses independently. Categorical data were analyzed using Chi-square test; ordinal data were analyzed using Kruskal–Wallis and Mann–Whitney tests.</div></div><div><h3>Results</h3><div>The accuracy scores ranged from 48% (Microsoft Copilot) to 71% (Gemini Advanced, GPT-3.5, and Claude 3.5 Sonnet) (<em>P</em> < .05). Gemini Advanced, Gemini, and Microsoft Copilot showed similar performance regardless of the question source (textbook or literature) (<em>P</em> > .05). GPT-3.5, GPT-4o, GPT-4.0 and Claude 3.5 Sonnet performed significantly better with textbook-based questions (<em>P</em> < .05). Reasoning scores showed different distribution among chatbots (<em>P</em> < .05). Gemini Advanced had the highest rate of score 2 (81%) and the lowest rate of score 0 (18.5%).</div></div><div><h3>Conclusions</h3><div>Comprehensive assessment of seven AI chatbots’ performance on board-style endodontic questions revealed their capacities and limitations as educational resources in the field of endodontics.</div></div>\",\"PeriodicalId\":15703,\"journal\":{\"name\":\"Journal of endodontics\",\"volume\":\"51 10\",\"pages\":\"Pages 1413-1419\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of endodontics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0099239925003747\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of endodontics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0099239925003747","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

摘要

本研究的目的是评估人工智能（AI）聊天机器人在回答板式牙髓问题方面的整体表现。方法：采用美国牙髓学委员会（ABE）笔试的方式，由两名经委员会认证的牙髓医师完成100道牙髓学选择题。这些问题被连续三次提交给以下聊天机器人：Gemini Advanced、Gemini、Microsoft Copilot、GPT-3.5、gpt - 40、GPT-4.0和Claude 3.5 Sonnet。聊天机器人被要求选择正确的回答并解释理由。只有聊天机器人在三次尝试中都选对了答案，这个回答才被认为是“正确的”。关于聊天机器人为什么选择答案的推理质量使用三序数量表（0,1,2）进行评分。两名经过校准的审稿人对所有2100个回答进行了独立评分。分类资料采用卡方检验分析；序贯数据采用Kruskal-Wallis检验和Mann-Whitney检验分析。结果：准确率评分从48% （Microsoft Copilot）到71% (Gemini Advanced、GPT-3.5和Claude 3.5 Sonnet) （p < 0.05）。GPT-3.5, gpt - 40, GPT-4.0和Claude 3.5 Sonnet在基于教科书的问题上表现明显更好(p结论：综合评估7个AI聊天机器人在板式牙髓问题上的表现，揭示了它们作为牙髓学领域教育资源的能力和局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance of 7 Artificial Intelligence Chatbots on Board-style Endodontic Questions

Introduction

The aim of this study was to assess the overall performance of artificial intelligence chatbots in answering board-style endodontic questions.

Methods

One hundred multiple choice endodontic questions, following the style of American Board of Endodontics Written Exam, were generated by two board-certified endodontists. The questions were submitted to the following chatbots, three times in a row: Gemini Advanced, Gemini, Microsoft Copilot, GPT-3.5, GPT-4o, GPT-4.0, and Claude 3.5 Sonnet. The chatbot was asked to choose the correct response and to explain the justification. The response to the question was considered “correct” only if the chatbot picked the right choice in ALL 3 attempts. The quality of reasoning as to why the chatbot selected the answer choice was scored using a three-ordinal scale (0, 1, 2). Two calibrated reviewers scored all 2100 responses independently. Categorical data were analyzed using Chi-square test; ordinal data were analyzed using Kruskal–Wallis and Mann–Whitney tests.

Results

The accuracy scores ranged from 48% (Microsoft Copilot) to 71% (Gemini Advanced, GPT-3.5, and Claude 3.5 Sonnet) (P < .05). Gemini Advanced, Gemini, and Microsoft Copilot showed similar performance regardless of the question source (textbook or literature) (P > .05). GPT-3.5, GPT-4o, GPT-4.0 and Claude 3.5 Sonnet performed significantly better with textbook-based questions (P < .05). Reasoning scores showed different distribution among chatbots (P < .05). Gemini Advanced had the highest rate of score 2 (81%) and the lowest rate of score 0 (18.5%).

Conclusions

Comprehensive assessment of seven AI chatbots’ performance on board-style endodontic questions revealed their capacities and limitations as educational resources in the field of endodontics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of endodontics 医学-牙科与口腔外科

CiteScore

8.80

自引率

9.50%

发文量

224

审稿时长

42 days

期刊介绍： The Journal of Endodontics, the official journal of the American Association of Endodontists, publishes scientific articles, case reports and comparison studies evaluating materials and methods of pulp conservation and endodontic treatment. Endodontists and general dentists can learn about new concepts in root canal treatment and the latest advances in techniques and instrumentation in the one journal that helps them keep pace with rapid changes in this field.