Nicolás Dufey-Portilla, Ana Billik Frisman, Maximiliano Gallardo Robles, Fernando Peña-Bengoa, Consuelo Cabrera Ávila, Venkateshbabu Nagendrababu, Paul M H Dummer, Marc Garcia-Font, Francesc Abella Sans
{"title":"Assessing the validity of ChatGPT-4o and Google Gemini Advanced when responding to frequently asked questions in endodontics.","authors":"Nicolás Dufey-Portilla, Ana Billik Frisman, Maximiliano Gallardo Robles, Fernando Peña-Bengoa, Consuelo Cabrera Ávila, Venkateshbabu Nagendrababu, Paul M H Dummer, Marc Garcia-Font, Francesc Abella Sans","doi":"10.1590/1678-7757-2025-0321","DOIUrl":null,"url":null,"abstract":"<p><p>Artificial intelligence (AI) is transforming access to dental information via large language models (LLMs) such as ChatGPT and Google Gemini. Both models are increasingly being used in endodontics as a source of information for patients. Therefore, as developers release new versions, the validity of their responses must be continuously compared to professional consultations.</p><p><strong>Objective: </strong>This study aimed to evaluate the validity of the responses provided by the most advanced LLMs [Google Gemini Advanced (GGA) and ChatGPT-4o] to frequently asked questions (FAQs) in endodontics.</p><p><strong>Methodology: </strong>A cross-sectional analytical study was conducted in five phases. The top 20 endodontic FAQs submitted by users to chatbots and collected from Google Trends were compiled. In total, nine academically certified endodontic specialists with educational roles scored GGA and ChatGPT-4o responses to the FAQs using a five-point Likert scale. Validity was determined using high (4.5-5) and low (≥4) thresholds. The Fisher's exact test was used for comparative analysis.</p><p><strong>Results: </strong>At the low threshold, both models obtained 95% validity (95% CI: 75.1%- 99.9%; p=.05). At the high threshold, ChatGPT-4o achieved 35% (95% CI: 15.4%- 59.2%) and GGA, 40% (95% CI: 19.1%- 63.9%) validity (p=1).</p><p><strong>Conclusions: </strong>ChatGPT-4o and GGA responses showed high validity under lenient criteria that significantly decreased under stricter thresholds, limiting their reliability as a stand-alone source of information in endodontics. While AI chatbots show promise to improve patient education in endodontics, their validity limitations under rigorous evaluation highlight the need for careful professional monitoring.</p>","PeriodicalId":15133,"journal":{"name":"Journal of Applied Oral Science","volume":"33 ","pages":"e20250321"},"PeriodicalIF":2.6000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Oral Science","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1590/1678-7757-2025-0321","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial intelligence (AI) is transforming access to dental information via large language models (LLMs) such as ChatGPT and Google Gemini. Both models are increasingly being used in endodontics as a source of information for patients. Therefore, as developers release new versions, the validity of their responses must be continuously compared to professional consultations.
Objective: This study aimed to evaluate the validity of the responses provided by the most advanced LLMs [Google Gemini Advanced (GGA) and ChatGPT-4o] to frequently asked questions (FAQs) in endodontics.
Methodology: A cross-sectional analytical study was conducted in five phases. The top 20 endodontic FAQs submitted by users to chatbots and collected from Google Trends were compiled. In total, nine academically certified endodontic specialists with educational roles scored GGA and ChatGPT-4o responses to the FAQs using a five-point Likert scale. Validity was determined using high (4.5-5) and low (≥4) thresholds. The Fisher's exact test was used for comparative analysis.
Results: At the low threshold, both models obtained 95% validity (95% CI: 75.1%- 99.9%; p=.05). At the high threshold, ChatGPT-4o achieved 35% (95% CI: 15.4%- 59.2%) and GGA, 40% (95% CI: 19.1%- 63.9%) validity (p=1).
Conclusions: ChatGPT-4o and GGA responses showed high validity under lenient criteria that significantly decreased under stricter thresholds, limiting their reliability as a stand-alone source of information in endodontics. While AI chatbots show promise to improve patient education in endodontics, their validity limitations under rigorous evaluation highlight the need for careful professional monitoring.
期刊介绍:
The Journal of Applied Oral Science is committed in publishing the scientific and technologic advances achieved by the dental community, according to the quality indicators and peer reviewed material, with the objective of assuring its acceptability at the local, regional, national and international levels. The primary goal of The Journal of Applied Oral Science is to publish the outcomes of original investigations as well as invited case reports and invited reviews in the field of Dentistry and related areas.