Evaluation of the performance of large language models in clinical decision-making in endodontics.

IF 2.6 2区 医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE
Yağız Özbay, Deniz Erdoğan, Gözde Akbal Dinçer
{"title":"Evaluation of the performance of large language models in clinical decision-making in endodontics.","authors":"Yağız Özbay, Deniz Erdoğan, Gözde Akbal Dinçer","doi":"10.1186/s12903-025-06050-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) chatbots are excellent at generating language. The growing use of generative AI large language models (LLMs) in healthcare and dentistry, including endodontics, raises questions about their accuracy. The potential of LLMs to assist clinicians' decision-making processes in endodontics is worth evaluating. This study aims to comparatively evaluate the answers provided by Google Bard, ChatGPT-3.5, and ChatGPT-4 to clinically relevant questions from the field of Endodontics.</p><p><strong>Methods: </strong>40 open-ended questions covering different areas of endodontics were prepared and were introduced to Google Bard, ChatGPT-3.5, and ChatGPT-4. Validity of the questions was evaluated using the Lawshe Content Validity Index. Two experienced endodontists, blinded to the chatbots, evaluated the answers using a 3-point Likert scale. All responses deemed to contain factually wrong information were noted and a misinformation rate for each LLM was calculated (number of answers containing wrong information/total number of questions). The One-way analysis of variance and Post Hoc Tukey test were used to analyze the data and significance was considered to be p < 0.05.</p><p><strong>Results: </strong>ChatGPT-4 demonstrated the highest score and the lowest misinformation rate (P = 0.008) followed by ChatGPT-3.5 and Google Bard respectively. The difference between ChatGPT-4 and Google Bard was statistically significant (P = 0.004).</p><p><strong>Conclusion: </strong>ChatGPT-4 provided more accurate and informative information in endodontics. However, all LLMs produced varying levels of incomplete or incorrect answers.</p>","PeriodicalId":9072,"journal":{"name":"BMC Oral Health","volume":"25 1","pages":"648"},"PeriodicalIF":2.6000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12039063/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Oral Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12903-025-06050-x","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Artificial intelligence (AI) chatbots are excellent at generating language. The growing use of generative AI large language models (LLMs) in healthcare and dentistry, including endodontics, raises questions about their accuracy. The potential of LLMs to assist clinicians' decision-making processes in endodontics is worth evaluating. This study aims to comparatively evaluate the answers provided by Google Bard, ChatGPT-3.5, and ChatGPT-4 to clinically relevant questions from the field of Endodontics.

Methods: 40 open-ended questions covering different areas of endodontics were prepared and were introduced to Google Bard, ChatGPT-3.5, and ChatGPT-4. Validity of the questions was evaluated using the Lawshe Content Validity Index. Two experienced endodontists, blinded to the chatbots, evaluated the answers using a 3-point Likert scale. All responses deemed to contain factually wrong information were noted and a misinformation rate for each LLM was calculated (number of answers containing wrong information/total number of questions). The One-way analysis of variance and Post Hoc Tukey test were used to analyze the data and significance was considered to be p < 0.05.

Results: ChatGPT-4 demonstrated the highest score and the lowest misinformation rate (P = 0.008) followed by ChatGPT-3.5 and Google Bard respectively. The difference between ChatGPT-4 and Google Bard was statistically significant (P = 0.004).

Conclusion: ChatGPT-4 provided more accurate and informative information in endodontics. However, all LLMs produced varying levels of incomplete or incorrect answers.

大型语言模型在牙髓学临床决策中的性能评价。
背景:人工智能(AI)聊天机器人在生成语言方面表现出色。生成式人工智能大语言模型(llm)在医疗保健和牙科(包括牙髓学)中的使用越来越多,这引发了对其准确性的质疑。LLMs在牙髓学中协助临床医生决策过程的潜力值得评估。本研究旨在比较评价谷歌Bard、ChatGPT-3.5、ChatGPT-4对牙髓学领域临床相关问题的回答。方法:准备涵盖不同牙髓学领域的40个开放式问题,并引入谷歌Bard、ChatGPT-3.5和ChatGPT-4。使用Lawshe内容效度指数评估问题的效度。两名经验丰富的牙髓医生对聊天机器人一无所知,用李克特3分制对答案进行评估。所有被认为包含错误信息的回答都会被记录下来,并计算每个法学硕士的错误信息率(包含错误信息的答案数量/问题总数)。采用单因素方差分析和Post Hoc Tukey检验对数据进行分析,认为显著性为p。结果:ChatGPT-4得分最高,错报率最低(p = 0.008),其次是ChatGPT-3.5和谷歌Bard。ChatGPT-4与谷歌Bard的差异有统计学意义(P = 0.004)。结论:ChatGPT-4为牙髓学提供了更准确、更丰富的信息。然而,所有法学硕士都给出了不同程度的不完整或不正确的答案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BMC Oral Health
BMC Oral Health DENTISTRY, ORAL SURGERY & MEDICINE-
CiteScore
3.90
自引率
6.90%
发文量
481
审稿时长
6-12 weeks
期刊介绍: BMC Oral Health is an open access, peer-reviewed journal that considers articles on all aspects of the prevention, diagnosis and management of disorders of the mouth, teeth and gums, as well as related molecular genetics, pathophysiology, and epidemiology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信