{"title":"The Emerging Role of AI in Patient Education: A Comparative Analysis of LLM Accuracy for Pelvic Organ Prolapse.","authors":"Sakine Rahimli Ocakoglu, Burhan Coskun","doi":"10.1159/000538538","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>This study aimed to evaluate the accuracy, completeness, precision, and readability of outputs generated by three Large Language Models (LLMs): GPT by OpenAI, BARD by Google, and Bing by Microsoft, in comparison to patient education material on Pelvic Organ Prolapse (POP) provided by the Royal College of Obstetricians and Gynecologists (RCOG).</p><p><strong>Methods: </strong>A total of 15 questions were retrieved from the RCOG website and input into the three LLMs. Two independent reviewers evaluated the outputs for accuracy, completeness, and precision. Readability was assessed using the Simplified Measure of Gobbledygook (SMOG) score and the Flesch-Kincaid Grade Level (FKGL) score.</p><p><strong>Results: </strong>Significant differences were observed in completeness and precision metrics. ChatGPT ranked highest in completeness (66.7%), while Bing led in precision (100%). No significant differences were observed in accuracy across all models. In terms of readability, ChatGPT exhibited higher difficulty than BARD, Bing, and the original RCOG answers.</p><p><strong>Conclusion: </strong>While all models displayed a variable degree of correctness, ChatGPT excelled in completeness, significantly surpassing BARD and Bing. However, Bing led in precision, providing the most relevant and concise answers. Regarding readability, ChatGPT exhibited higher difficulty. The study found that while all LLMs showed varying degrees of correctness in answering RCOG questions on patient information for Pelvic Organ Prolapse (POP), ChatGPT was the most comprehensive, but its answers were harder to read. Bing, on the other hand, was the most precise. The findings highlight the potential of LLMs in health information dissemination and the need for careful interpretation of their outputs.</p>","PeriodicalId":18455,"journal":{"name":"Medical Principles and Practice","volume":" ","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11324208/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Principles and Practice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1159/000538538","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: This study aimed to evaluate the accuracy, completeness, precision, and readability of outputs generated by three Large Language Models (LLMs): GPT by OpenAI, BARD by Google, and Bing by Microsoft, in comparison to patient education material on Pelvic Organ Prolapse (POP) provided by the Royal College of Obstetricians and Gynecologists (RCOG).
Methods: A total of 15 questions were retrieved from the RCOG website and input into the three LLMs. Two independent reviewers evaluated the outputs for accuracy, completeness, and precision. Readability was assessed using the Simplified Measure of Gobbledygook (SMOG) score and the Flesch-Kincaid Grade Level (FKGL) score.
Results: Significant differences were observed in completeness and precision metrics. ChatGPT ranked highest in completeness (66.7%), while Bing led in precision (100%). No significant differences were observed in accuracy across all models. In terms of readability, ChatGPT exhibited higher difficulty than BARD, Bing, and the original RCOG answers.
Conclusion: While all models displayed a variable degree of correctness, ChatGPT excelled in completeness, significantly surpassing BARD and Bing. However, Bing led in precision, providing the most relevant and concise answers. Regarding readability, ChatGPT exhibited higher difficulty. The study found that while all LLMs showed varying degrees of correctness in answering RCOG questions on patient information for Pelvic Organ Prolapse (POP), ChatGPT was the most comprehensive, but its answers were harder to read. Bing, on the other hand, was the most precise. The findings highlight the potential of LLMs in health information dissemination and the need for careful interpretation of their outputs.
期刊介绍:
''Medical Principles and Practice'', as the journal of the Health Sciences Centre, Kuwait University, aims to be a publication of international repute that will be a medium for dissemination and exchange of scientific knowledge in the health sciences.