领先聊天机器人对子宫内膜异位症问题的生成式人工智能反应的比较分析。

AJOG global reports Pub Date : 2025-02-01 DOI:10.1016/j.xagr.2024.100405

Natalie D. Cohen MD, Milan Ho BS, Donald McIntire PhD, Katherine Smith MD, Kimberly A. Kho MD

{"title":"领先聊天机器人对子宫内膜异位症问题的生成式人工智能反应的比较分析。","authors":"Natalie D. Cohen MD, Milan Ho BS, Donald McIntire PhD, Katherine Smith MD, Kimberly A. Kho MD","doi":"10.1016/j.xagr.2024.100405","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them.</div></div><div><h3>Objective</h3><div>This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them.</div></div><div><h3>Study Design</h3><div>Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's <em>W</em> and the related chi-square test were used to evaluate the reviewers’ strength of agreement in ranking the LLMs’ responses for each item.</div></div><div><h3>Results</h3><div>Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence.</div></div><div><h3>Conclusion</h3><div>The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.</div></div>","PeriodicalId":72141,"journal":{"name":"AJOG global reports","volume":"5 1","pages":"Article 100405"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11730533/pdf/","citationCount":"0","resultStr":"{\"title\":\"A comparative analysis of generative artificial intelligence responses from leading chatbots to questions about endometriosis\",\"authors\":\"Natalie D. Cohen MD, Milan Ho BS, Donald McIntire PhD, Katherine Smith MD, Kimberly A. Kho MD\",\"doi\":\"10.1016/j.xagr.2024.100405\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them.</div></div><div><h3>Objective</h3><div>This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them.</div></div><div><h3>Study Design</h3><div>Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's <em>W</em> and the related chi-square test were used to evaluate the reviewers’ strength of agreement in ranking the LLMs’ responses for each item.</div></div><div><h3>Results</h3><div>Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence.</div></div><div><h3>Conclusion</h3><div>The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.</div></div>\",\"PeriodicalId\":72141,\"journal\":{\"name\":\"AJOG global reports\",\"volume\":\"5 1\",\"pages\":\"Article 100405\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11730533/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AJOG global reports\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666577824000996\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AJOG global reports","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666577824000996","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

导读：生成式人工智能（AI）的使用已经开始渗透到包括医学在内的大多数行业，患者将不可避免地开始使用这些大型语言模型（LLM）聊天机器人作为教育的一种方式。随着医疗信息技术的发展，有必要评估聊天机器人及其向患者提供的信息的准确性，并确定它们之间是否存在可变性。目的：本研究旨在评估三种聊天机器人在解决子宫内膜异位症相关问题时的准确性和全面性，并确定它们之间的可变性水平。研究设计：三位法学硕士，包括Chat GPT-4 （Open AI）、Claude （Anthropic）和Bard （b谷歌），被要求回答关于子宫内膜异位症的10个常见问题。这些反应与目前的子宫内膜异位症指南和专家意见进行了定性比较，并由9位妇科医生进行了评分。评分标准包括：(1)完全不正确，(2)大部分不正确，部分正确，(3)大部分正确，部分不正确，(4)正确但不充分，(5)正确且全面。最后的分数由9位评论者取平均值。采用Kendall’s W和相关的卡方检验来评价评论者对法学硕士各项目回答排序的一致程度。结果：巴德、Chat GPT和克劳德的10个答案的平均得分分别为3.69分、4.24分和3.7分。有两个问题显示了9位审稿人之间的重大分歧。没有问题的模型可以全面或正确地回答审稿人。与全面和正确的反应最相关的模型是ChatGPT。聊天机器人在准确回答有关症状和病理生理的问题以及治疗和复发风险方面的能力有所提高。结论：对llm的分析显示，平均而言，他们主要对子宫内膜异位症患者的常见问题提供了正确但不充分的回答。虽然聊天机器人的回答可以作为有执照的医疗专业人员提供的信息的有价值的补充，但至关重要的是要对产出保持一个彻底的持续评估过程，以便向患者提供最全面和最准确的信息。随着生成式人工智能越来越多地融入医疗领域，进一步研究这项技术及其在患者教育和治疗中的作用至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A comparative analysis of generative artificial intelligence responses from leading chatbots to questions about endometriosis

Introduction

The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them.

Objective

This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them.

Study Design

Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's W and the related chi-square test were used to evaluate the reviewers’ strength of agreement in ranking the LLMs’ responses for each item.

Results

Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence.

Conclusion

The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

AJOG global reports Endocrinology, Diabetes and Metabolism, Obstetrics, Gynecology and Women's Health, Perinatology, Pediatrics and Child Health, Urology

CiteScore

1.20

自引率

0.00%

发文量