The Emerging Role of AI in Patient Education: A Comparative Analysis of LLM Accuracy for Pelvic Organ Prolapse.

IF 2.9 3区医学 Q1 MEDICINE, GENERAL & INTERNAL

Medical Principles and Practice Pub Date : 2024-03-25 DOI:10.1159/000538538

Sakine Rahimli Ocakoglu, Burhan Coskun

{"title":"The Emerging Role of AI in Patient Education: A Comparative Analysis of LLM Accuracy for Pelvic Organ Prolapse.","authors":"Sakine Rahimli Ocakoglu, Burhan Coskun","doi":"10.1159/000538538","DOIUrl":null,"url":null,"abstract":"Objective: This study aimed to evaluate the accuracy, completeness, precision, and readability of outputs generated by three Large Language Models (LLMs): GPT by OpenAI, BARD by Google, and Bing by Microsoft, in comparison to patient education material on Pelvic Organ Prolapse (POP) provided by the Royal College of Obstetricians and Gynecologists (RCOG).Methods: A total of 15 questions were retrieved from the RCOG website and input into the three LLMs. Two independent reviewers evaluated the outputs for accuracy, completeness, and precision. Readability was assessed using the Simplified Measure of Gobbledygook (SMOG) score and the Flesch-Kincaid Grade Level (FKGL) score.Results: Significant differences were observed in completeness and precision metrics. ChatGPT ranked highest in completeness (66.7%), while Bing led in precision (100%). No significant differences were observed in accuracy across all models. In terms of readability, ChatGPT exhibited higher difficulty than BARD, Bing, and the original RCOG answers.Conclusion: While all models displayed a variable degree of correctness, ChatGPT excelled in completeness, significantly surpassing BARD and Bing. However, Bing led in precision, providing the most relevant and concise answers. Regarding readability, ChatGPT exhibited higher difficulty. The study found that while all LLMs showed varying degrees of correctness in answering RCOG questions on patient information for Pelvic Organ Prolapse (POP), ChatGPT was the most comprehensive, but its answers were harder to read. Bing, on the other hand, was the most precise. The findings highlight the potential of LLMs in health information dissemination and the need for careful interpretation of their outputs.","PeriodicalId":18455,"journal":{"name":"Medical Principles and Practice","volume":" ","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11324208/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Principles and Practice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1159/000538538","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: This study aimed to evaluate the accuracy, completeness, precision, and readability of outputs generated by three Large Language Models (LLMs): GPT by OpenAI, BARD by Google, and Bing by Microsoft, in comparison to patient education material on Pelvic Organ Prolapse (POP) provided by the Royal College of Obstetricians and Gynecologists (RCOG).

Methods: A total of 15 questions were retrieved from the RCOG website and input into the three LLMs. Two independent reviewers evaluated the outputs for accuracy, completeness, and precision. Readability was assessed using the Simplified Measure of Gobbledygook (SMOG) score and the Flesch-Kincaid Grade Level (FKGL) score.

Results: Significant differences were observed in completeness and precision metrics. ChatGPT ranked highest in completeness (66.7%), while Bing led in precision (100%). No significant differences were observed in accuracy across all models. In terms of readability, ChatGPT exhibited higher difficulty than BARD, Bing, and the original RCOG answers.

Conclusion: While all models displayed a variable degree of correctness, ChatGPT excelled in completeness, significantly surpassing BARD and Bing. However, Bing led in precision, providing the most relevant and concise answers. Regarding readability, ChatGPT exhibited higher difficulty. The study found that while all LLMs showed varying degrees of correctness in answering RCOG questions on patient information for Pelvic Organ Prolapse (POP), ChatGPT was the most comprehensive, but its answers were harder to read. Bing, on the other hand, was the most precise. The findings highlight the potential of LLMs in health information dissemination and the need for careful interpretation of their outputs.

查看原文本刊更多论文

人工智能在患者教育中的新兴角色：盆腔脏器脱垂 LLM 准确性对比分析。

研究目的本研究旨在评估三种大型语言模型（LLM）生成的输出结果的准确性、完整性、精确性和可读性：与英国皇家妇产科医学院（RCOG）提供的骨盆器官脱垂（POP）患者教育材料进行比较：从 RCOG 网站上共检索到 15 个问题，并输入到三个 LLM 中。两名独立审查员对输出结果的准确性、完整性和精确性进行了评估。可读性则采用 "简化拗口程度"（SMOG）评分和 "弗莱什-金凯德等级"（FKGL）评分进行评估：结果：在完整性和精确度指标方面发现了显著差异。ChatGPT 的完整度（66.7%）最高，而必应的精确度（100%）领先。所有模型在准确性方面均无明显差异。在可读性方面，ChatGPT 的难度高于 BARD、Bing 和原始 RCOG 答案：结论：虽然所有模型都显示出不同程度的正确性，但 ChatGPT 在完整性方面表现突出，明显超过了 BARD 和 Bing。不过，Bing 在精确度方面领先，提供了最相关、最简洁的答案。在可读性方面，ChatGPT 表现出更高的难度。研究发现，在回答 RCOG 有关盆腔器官脱垂（POP）患者信息的问题时，所有 LLM 都表现出不同程度的正确性，但 ChatGPT 是最全面的，但其答案较难阅读。而 Bing 则最为精确。研究结果凸显了 LLM 在健康信息传播方面的潜力，以及仔细解读其产出的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical Principles and Practice 医学-医学：内科

CiteScore

6.10

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： ''Medical Principles and Practice'', as the journal of the Health Sciences Centre, Kuwait University, aims to be a publication of international repute that will be a medium for dissemination and exchange of scientific knowledge in the health sciences.