João Daniel Mendonça de Moura , Carlos Eduardo Fontana , Vitor Henrique Reis da Silva Lima , Iris de Souza Alves , Paulo André de Melo Santos , Patrícia de Almeida Rodrigues
{"title":"Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: A cross-sectional study","authors":"João Daniel Mendonça de Moura , Carlos Eduardo Fontana , Vitor Henrique Reis da Silva Lima , Iris de Souza Alves , Paulo André de Melo Santos , Patrícia de Almeida Rodrigues","doi":"10.1016/j.compbiomed.2024.109332","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>This study aimed to evaluate the diagnostic accuracy and treatment recommendation performance of four artificial intelligence chatbots in fictional pulpal and periradicular disease cases. Additionally, it investigated response consistency and the influence of text order and language on chatbot performance.</div></div><div><h3>Methods</h3><div>In this cross-sectional comparative study, eleven cases representing various pulpal and periradicular pathologies were created. These cases were presented to four chatbots (ChatGPT 3.5, ChatGPT 4.0, Bard, and Bing) in both Portuguese and English, with the information order varied (signs and symptoms first or imaging data first). Statistical analyses included the Kruskal-Wallis test, Dwass-Steel-Critchlow-Fligner pairwise comparisons, simple logistic regression, and the binomial test.</div></div><div><h3>Results</h3><div>Bing and ChatGPT 4.0 achieved the highest diagnostic accuracy rates (86.4 % and 85.3 % respectively), significantly outperforming ChatGPT 3.5 (46.5 %) and Bard (28.6 %) (p < 0.001). For treatment recommendations, ChatGPT 4.0, Bing, and ChatGPT 3.5 performed similarly (94.4 %, 93.2 %, and 86.3 %, respectively), while Bard exhibited significantly lower accuracy (75 %, p < 0.001). No significant association between diagnosis and treatment accuracy was found for Bard and Bing, but a positive association was observed for ChatGPT 3.5 and ChatGPT 4.0 (p < 0.05). The overall consistency rate was 98.29 %, with no significant differences related to text order or language. Cases presented in Portuguese prompted significantly more additional information requests than those in English (33.5 % vs. 10.2 %; p < 0.001), with the relevance of this information being higher in Portuguese (29.5 % vs. 8.5 %; p < 0.001).</div></div><div><h3>Conclusions</h3><div>Bing and ChatGPT 4.0 demonstrated superior diagnostic accuracy, while Bard showed the lowest accuracy in both diagnosis and treatment recommendations. However, the clinical application of these tools necessitates critical interpretation by dentists, as chatbot responses are not consistently reliable.</div></div>","PeriodicalId":10578,"journal":{"name":"Computers in biology and medicine","volume":"183 ","pages":"Article 109332"},"PeriodicalIF":7.0000,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in biology and medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010482524014173","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives
This study aimed to evaluate the diagnostic accuracy and treatment recommendation performance of four artificial intelligence chatbots in fictional pulpal and periradicular disease cases. Additionally, it investigated response consistency and the influence of text order and language on chatbot performance.
Methods
In this cross-sectional comparative study, eleven cases representing various pulpal and periradicular pathologies were created. These cases were presented to four chatbots (ChatGPT 3.5, ChatGPT 4.0, Bard, and Bing) in both Portuguese and English, with the information order varied (signs and symptoms first or imaging data first). Statistical analyses included the Kruskal-Wallis test, Dwass-Steel-Critchlow-Fligner pairwise comparisons, simple logistic regression, and the binomial test.
Results
Bing and ChatGPT 4.0 achieved the highest diagnostic accuracy rates (86.4 % and 85.3 % respectively), significantly outperforming ChatGPT 3.5 (46.5 %) and Bard (28.6 %) (p < 0.001). For treatment recommendations, ChatGPT 4.0, Bing, and ChatGPT 3.5 performed similarly (94.4 %, 93.2 %, and 86.3 %, respectively), while Bard exhibited significantly lower accuracy (75 %, p < 0.001). No significant association between diagnosis and treatment accuracy was found for Bard and Bing, but a positive association was observed for ChatGPT 3.5 and ChatGPT 4.0 (p < 0.05). The overall consistency rate was 98.29 %, with no significant differences related to text order or language. Cases presented in Portuguese prompted significantly more additional information requests than those in English (33.5 % vs. 10.2 %; p < 0.001), with the relevance of this information being higher in Portuguese (29.5 % vs. 8.5 %; p < 0.001).
Conclusions
Bing and ChatGPT 4.0 demonstrated superior diagnostic accuracy, while Bard showed the lowest accuracy in both diagnosis and treatment recommendations. However, the clinical application of these tools necessitates critical interpretation by dentists, as chatbot responses are not consistently reliable.
期刊介绍:
Computers in Biology and Medicine is an international forum for sharing groundbreaking advancements in the use of computers in bioscience and medicine. This journal serves as a medium for communicating essential research, instruction, ideas, and information regarding the rapidly evolving field of computer applications in these domains. By encouraging the exchange of knowledge, we aim to facilitate progress and innovation in the utilization of computers in biology and medicine.