{"title":"Evaluating the validity and consistency of artificial intelligence chatbots in responding to patients' frequently asked questions in prosthodontics.","authors":"Maryam Gheisarifar, Marwa Shembesh, Merve Koseoglu, Qiao Fang, Fatemeh Solmaz Afshari, Judy Chia-Chun Yuan, Cortino Sukotjo","doi":"10.1016/j.prosdent.2025.03.009","DOIUrl":null,"url":null,"abstract":"<p><strong>Statement of problem: </strong>Healthcare-related information provided by artificial intelligence (AI) chatbots may pose challenges such as inaccuracies, lack of empathy, biases, over-reliance, limited scope, and ethical concerns.</p><p><strong>Purpose: </strong>The purpose of this study was to evaluate and compare the validity and consistency of responses to prosthodontics-related frequently asked questions (FAQ) generated by 4 different chatbot systems.</p><p><strong>Material and methods: </strong>Four prosthodontics domains were evaluated: implant, fixed prosthodontics, complete denture (CD), and removable partial denture (RPD). Within each domain, 10 questions were prepared by full-time prosthodontic faculty members, and 10 questions were generated by GPT-3.5, representing its top frequently asked questions in each domain. The validity and consistency of responses provided by 4 chatbots: GPT-3.5, GPT-4, Gemini, and Bing were evaluated. The chi-squared test with the Yates correction was used to compare the validity of responses between different chatbots (α=.05). The Cronbach alpha was calculated for 3 sets of responses collected in the morning, afternoon, and evening to evaluate the consistency of the responses.</p><p><strong>Results: </strong>According to the low threshold validity test, the chatbots' answers to ChatGPT's implant-related, ChatGPT's RPD-related, and prosthodontists' CD-related FAQs were statistically different (P<.001, P<.001, and P=.004, respectively), with Bing being the lowest. At the high threshold validity test, the chatbots' answers to ChatGPT's implant-related and RPD-related FAQs and ChatGPT's and prosthodontists' fixed prosthetics-related and CD-related FAQs were statistically different (P<.001, P<.001, P=.004, P=.002, and P=.003, respectively), with Bing being the lowest. Overall, all 4 chatbots demonstrated lower validity at the high threshold than the low threshold. Bing, Gemini, and ChatGPT-4 chatbots displayed an acceptable level of consistency, while ChatGPT-3.5 did not.</p><p><strong>Conclusions: </strong>Currently, AI chatbots show limitations in delivering answers to patients' prosthodontic-related FAQs with high validity and consistency.</p>","PeriodicalId":16866,"journal":{"name":"Journal of Prosthetic Dentistry","volume":" ","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Prosthetic Dentistry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.prosdent.2025.03.009","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Statement of problem: Healthcare-related information provided by artificial intelligence (AI) chatbots may pose challenges such as inaccuracies, lack of empathy, biases, over-reliance, limited scope, and ethical concerns.
Purpose: The purpose of this study was to evaluate and compare the validity and consistency of responses to prosthodontics-related frequently asked questions (FAQ) generated by 4 different chatbot systems.
Material and methods: Four prosthodontics domains were evaluated: implant, fixed prosthodontics, complete denture (CD), and removable partial denture (RPD). Within each domain, 10 questions were prepared by full-time prosthodontic faculty members, and 10 questions were generated by GPT-3.5, representing its top frequently asked questions in each domain. The validity and consistency of responses provided by 4 chatbots: GPT-3.5, GPT-4, Gemini, and Bing were evaluated. The chi-squared test with the Yates correction was used to compare the validity of responses between different chatbots (α=.05). The Cronbach alpha was calculated for 3 sets of responses collected in the morning, afternoon, and evening to evaluate the consistency of the responses.
Results: According to the low threshold validity test, the chatbots' answers to ChatGPT's implant-related, ChatGPT's RPD-related, and prosthodontists' CD-related FAQs were statistically different (P<.001, P<.001, and P=.004, respectively), with Bing being the lowest. At the high threshold validity test, the chatbots' answers to ChatGPT's implant-related and RPD-related FAQs and ChatGPT's and prosthodontists' fixed prosthetics-related and CD-related FAQs were statistically different (P<.001, P<.001, P=.004, P=.002, and P=.003, respectively), with Bing being the lowest. Overall, all 4 chatbots demonstrated lower validity at the high threshold than the low threshold. Bing, Gemini, and ChatGPT-4 chatbots displayed an acceptable level of consistency, while ChatGPT-3.5 did not.
Conclusions: Currently, AI chatbots show limitations in delivering answers to patients' prosthodontic-related FAQs with high validity and consistency.
期刊介绍:
The Journal of Prosthetic Dentistry is the leading professional journal devoted exclusively to prosthetic and restorative dentistry. The Journal is the official publication for 24 leading U.S. international prosthodontic organizations. The monthly publication features timely, original peer-reviewed articles on the newest techniques, dental materials, and research findings. The Journal serves prosthodontists and dentists in advanced practice, and features color photos that illustrate many step-by-step procedures. The Journal of Prosthetic Dentistry is included in Index Medicus and CINAHL.