Natalie D. Cohen MD, Milan Ho BS, Donald McIntire PhD, Katherine Smith MD, Kimberly A. Kho MD
{"title":"A comparative analysis of generative artificial intelligence responses from leading chatbots to questions about endometriosis","authors":"Natalie D. Cohen MD, Milan Ho BS, Donald McIntire PhD, Katherine Smith MD, Kimberly A. Kho MD","doi":"10.1016/j.xagr.2024.100405","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them.</div></div><div><h3>Objective</h3><div>This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them.</div></div><div><h3>Study Design</h3><div>Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's <em>W</em> and the related chi-square test were used to evaluate the reviewers’ strength of agreement in ranking the LLMs’ responses for each item.</div></div><div><h3>Results</h3><div>Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence.</div></div><div><h3>Conclusion</h3><div>The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.</div></div>","PeriodicalId":72141,"journal":{"name":"AJOG global reports","volume":"5 1","pages":"Article 100405"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11730533/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AJOG global reports","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666577824000996","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction
The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them.
Objective
This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them.
Study Design
Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's W and the related chi-square test were used to evaluate the reviewers’ strength of agreement in ranking the LLMs’ responses for each item.
Results
Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence.
Conclusion
The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.
AJOG global reportsEndocrinology, Diabetes and Metabolism, Obstetrics, Gynecology and Women's Health, Perinatology, Pediatrics and Child Health, Urology