Poorya Jalali DDS , Hossein Mohammad-Rahimi DDS , Feng-Ming Wang DDS, PhD , Fatemeh Sohrabniya DDS , Seyed AmirHossein Ourang DDS , Yuke Tian BS, MS , Frederico C. Martinho DDS, MSc, PhD , Ali Nosrat DDS, MS, MDS
{"title":"7个人工智能聊天机器人在板状牙髓问题上的表现。","authors":"Poorya Jalali DDS , Hossein Mohammad-Rahimi DDS , Feng-Ming Wang DDS, PhD , Fatemeh Sohrabniya DDS , Seyed AmirHossein Ourang DDS , Yuke Tian BS, MS , Frederico C. Martinho DDS, MSc, PhD , Ali Nosrat DDS, MS, MDS","doi":"10.1016/j.joen.2025.06.014","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>The aim of this study was to assess the overall performance of artificial intelligence chatbots in answering board-style endodontic questions.</div></div><div><h3>Methods</h3><div>One hundred multiple choice endodontic questions, following the style of American Board of Endodontics Written Exam, were generated by two board-certified endodontists. The questions were submitted to the following chatbots, three times in a row: Gemini Advanced, Gemini, Microsoft Copilot, GPT-3.5, GPT-4o, GPT-4.0, and Claude 3.5 Sonnet. The chatbot was asked to choose the correct response and to explain the justification. The response to the question was considered “correct” only if the chatbot picked the right choice in ALL 3 attempts. The quality of reasoning as to why the chatbot selected the answer choice was scored using a three-ordinal scale (0, 1, 2). Two calibrated reviewers scored all 2100 responses independently. Categorical data were analyzed using Chi-square test; ordinal data were analyzed using Kruskal–Wallis and Mann–Whitney tests.</div></div><div><h3>Results</h3><div>The accuracy scores ranged from 48% (Microsoft Copilot) to 71% (Gemini Advanced, GPT-3.5, and Claude 3.5 Sonnet) (<em>P</em> < .05). Gemini Advanced, Gemini, and Microsoft Copilot showed similar performance regardless of the question source (textbook or literature) (<em>P</em> > .05). GPT-3.5, GPT-4o, GPT-4.0 and Claude 3.5 Sonnet performed significantly better with textbook-based questions (<em>P</em> < .05). Reasoning scores showed different distribution among chatbots (<em>P</em> < .05). Gemini Advanced had the highest rate of score 2 (81%) and the lowest rate of score 0 (18.5%).</div></div><div><h3>Conclusions</h3><div>Comprehensive assessment of seven AI chatbots’ performance on board-style endodontic questions revealed their capacities and limitations as educational resources in the field of endodontics.</div></div>","PeriodicalId":15703,"journal":{"name":"Journal of endodontics","volume":"51 10","pages":"Pages 1413-1419"},"PeriodicalIF":3.6000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of 7 Artificial Intelligence Chatbots on Board-style Endodontic Questions\",\"authors\":\"Poorya Jalali DDS , Hossein Mohammad-Rahimi DDS , Feng-Ming Wang DDS, PhD , Fatemeh Sohrabniya DDS , Seyed AmirHossein Ourang DDS , Yuke Tian BS, MS , Frederico C. Martinho DDS, MSc, PhD , Ali Nosrat DDS, MS, MDS\",\"doi\":\"10.1016/j.joen.2025.06.014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>The aim of this study was to assess the overall performance of artificial intelligence chatbots in answering board-style endodontic questions.</div></div><div><h3>Methods</h3><div>One hundred multiple choice endodontic questions, following the style of American Board of Endodontics Written Exam, were generated by two board-certified endodontists. The questions were submitted to the following chatbots, three times in a row: Gemini Advanced, Gemini, Microsoft Copilot, GPT-3.5, GPT-4o, GPT-4.0, and Claude 3.5 Sonnet. The chatbot was asked to choose the correct response and to explain the justification. The response to the question was considered “correct” only if the chatbot picked the right choice in ALL 3 attempts. The quality of reasoning as to why the chatbot selected the answer choice was scored using a three-ordinal scale (0, 1, 2). Two calibrated reviewers scored all 2100 responses independently. Categorical data were analyzed using Chi-square test; ordinal data were analyzed using Kruskal–Wallis and Mann–Whitney tests.</div></div><div><h3>Results</h3><div>The accuracy scores ranged from 48% (Microsoft Copilot) to 71% (Gemini Advanced, GPT-3.5, and Claude 3.5 Sonnet) (<em>P</em> < .05). Gemini Advanced, Gemini, and Microsoft Copilot showed similar performance regardless of the question source (textbook or literature) (<em>P</em> > .05). GPT-3.5, GPT-4o, GPT-4.0 and Claude 3.5 Sonnet performed significantly better with textbook-based questions (<em>P</em> < .05). Reasoning scores showed different distribution among chatbots (<em>P</em> < .05). Gemini Advanced had the highest rate of score 2 (81%) and the lowest rate of score 0 (18.5%).</div></div><div><h3>Conclusions</h3><div>Comprehensive assessment of seven AI chatbots’ performance on board-style endodontic questions revealed their capacities and limitations as educational resources in the field of endodontics.</div></div>\",\"PeriodicalId\":15703,\"journal\":{\"name\":\"Journal of endodontics\",\"volume\":\"51 10\",\"pages\":\"Pages 1413-1419\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of endodontics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0099239925003747\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of endodontics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0099239925003747","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
Performance of 7 Artificial Intelligence Chatbots on Board-style Endodontic Questions
Introduction
The aim of this study was to assess the overall performance of artificial intelligence chatbots in answering board-style endodontic questions.
Methods
One hundred multiple choice endodontic questions, following the style of American Board of Endodontics Written Exam, were generated by two board-certified endodontists. The questions were submitted to the following chatbots, three times in a row: Gemini Advanced, Gemini, Microsoft Copilot, GPT-3.5, GPT-4o, GPT-4.0, and Claude 3.5 Sonnet. The chatbot was asked to choose the correct response and to explain the justification. The response to the question was considered “correct” only if the chatbot picked the right choice in ALL 3 attempts. The quality of reasoning as to why the chatbot selected the answer choice was scored using a three-ordinal scale (0, 1, 2). Two calibrated reviewers scored all 2100 responses independently. Categorical data were analyzed using Chi-square test; ordinal data were analyzed using Kruskal–Wallis and Mann–Whitney tests.
Results
The accuracy scores ranged from 48% (Microsoft Copilot) to 71% (Gemini Advanced, GPT-3.5, and Claude 3.5 Sonnet) (P < .05). Gemini Advanced, Gemini, and Microsoft Copilot showed similar performance regardless of the question source (textbook or literature) (P > .05). GPT-3.5, GPT-4o, GPT-4.0 and Claude 3.5 Sonnet performed significantly better with textbook-based questions (P < .05). Reasoning scores showed different distribution among chatbots (P < .05). Gemini Advanced had the highest rate of score 2 (81%) and the lowest rate of score 0 (18.5%).
Conclusions
Comprehensive assessment of seven AI chatbots’ performance on board-style endodontic questions revealed their capacities and limitations as educational resources in the field of endodontics.
期刊介绍:
The Journal of Endodontics, the official journal of the American Association of Endodontists, publishes scientific articles, case reports and comparison studies evaluating materials and methods of pulp conservation and endodontic treatment. Endodontists and general dentists can learn about new concepts in root canal treatment and the latest advances in techniques and instrumentation in the one journal that helps them keep pace with rapid changes in this field.