The use of a large language model to create plain language summaries of evidence reviews in healthcare: A feasibility study

Cochrane Evidence Synthesis and Methods Pub Date : 2024-02-04 DOI:10.1002/cesm.12041

Colleen Ovelman, Shannon Kugley, Gerald Gartlehner, Meera Viswanathan

{"title":"The use of a large language model to create plain language summaries of evidence reviews in healthcare: A feasibility study","authors":"Colleen Ovelman, Shannon Kugley, Gerald Gartlehner, Meera Viswanathan","doi":"10.1002/cesm.12041","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>Plain language summaries (PLSs) make complex healthcare evidence accessible to patients and the public. Large language models (LLMs) may assist in generating accurate, readable PLSs. This study explored using the LLM Claude 2 to create PLSs of evidence reviews from the Agency for Healthcare Research and Quality (AHRQ) Effective Health Care Program.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We selected 10 evidence reviews published from 2021 to 2023, representing a range of methods and topics. We iteratively developed a prompt to guide Claude 2 in creating PLSs which included specifications for plain language, reading level, length, organizational structure, active voice, and inclusive language. PLSs were assessed for adherence to prompt specifications, comprehensiveness, accuracy, readability, and cultural sensitivity.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>All PLSs met the word count. We judged one PLS as fully comprehensive; seven mostly comprehensive. We judged two PLSs as fully capturing the PICO elements; five with minor PICO errors. We judged three PLSs as accurately reporting the results; and four with minor result errors. We judged three PLSs as having major result errors for incorrectly reporting total participants. Five PLSs met the target 6th to 8th grade reading level. Passive voice use averaged 16%. All PLSs used inclusive language.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>LLMs show promise for assisting in PLS creation but likely require human input to ensure accuracy, comprehensiveness, and the appropriate nuances of interpretation. Iterative prompt refinement may improve results and address the needs of specific reviews and audiences. As text-only summaries, the AI-generated PLSs could not meet all consumer communication criteria, such as textual design and visual representations. Further testing should include consumer reviewers and explore how to best leverage LLM support in drafting PLS text for complex evidence reviews.</p>\n </section>\n </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"2 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.12041","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.12041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

Plain language summaries (PLSs) make complex healthcare evidence accessible to patients and the public. Large language models (LLMs) may assist in generating accurate, readable PLSs. This study explored using the LLM Claude 2 to create PLSs of evidence reviews from the Agency for Healthcare Research and Quality (AHRQ) Effective Health Care Program.

Methods

We selected 10 evidence reviews published from 2021 to 2023, representing a range of methods and topics. We iteratively developed a prompt to guide Claude 2 in creating PLSs which included specifications for plain language, reading level, length, organizational structure, active voice, and inclusive language. PLSs were assessed for adherence to prompt specifications, comprehensiveness, accuracy, readability, and cultural sensitivity.

Results

All PLSs met the word count. We judged one PLS as fully comprehensive; seven mostly comprehensive. We judged two PLSs as fully capturing the PICO elements; five with minor PICO errors. We judged three PLSs as accurately reporting the results; and four with minor result errors. We judged three PLSs as having major result errors for incorrectly reporting total participants. Five PLSs met the target 6th to 8th grade reading level. Passive voice use averaged 16%. All PLSs used inclusive language.

Conclusions

LLMs show promise for assisting in PLS creation but likely require human input to ensure accuracy, comprehensiveness, and the appropriate nuances of interpretation. Iterative prompt refinement may improve results and address the needs of specific reviews and audiences. As text-only summaries, the AI-generated PLSs could not meet all consumer communication criteria, such as textual design and visual representations. Further testing should include consumer reviewers and explore how to best leverage LLM support in drafting PLS text for complex evidence reviews.

查看原文本刊更多论文

使用大型语言模型为医疗保健领域的证据综述创建通俗易懂的摘要：可行性研究

导言普通语言摘要（PLS）使患者和公众能够获得复杂的医疗证据。大语言模型（LLM）可帮助生成准确、可读的 PLS。本研究探讨了如何使用 LLM Claude 2 创建来自美国医疗保健研究与质量局（AHRQ）有效医疗保健项目的证据综述的 PLS。方法我们选择了 10 篇发表于 2021 年至 2023 年的证据综述，它们代表了一系列方法和主题。我们反复编写了一份提示，用于指导 Claude 2 创建 PLS，其中包括对平实语言、阅读水平、篇幅、组织结构、主动语态和包容性语言的规范。我们对 PLS 进行了评估，以确定其是否符合提示规范、全面性、准确性、可读性和文化敏感性。结果所有 PLS 均符合字数要求。我们判定一份 PLS 完全全面；七份基本全面。我们判定两份 PLS 完全符合 PICO 要素；五份存在轻微的 PICO 错误。我们判定 3 份 PLS 准确地报告了结果；4 份有轻微的结果错误。我们判定 3 份 PLS 存在重大结果错误，因为它们错误地报告了参与者总数。五份 PLS 达到了六至八年级的目标阅读水平。被动语态使用率平均为 16%。所有 PLS 都使用了包容性语言。结论 LLMs 有助于 PLS 的创建，但可能需要人工输入以确保准确性、全面性和适当的细微解释。迭代提示改进可能会改善结果，并满足特定评论和受众的需求。作为纯文本摘要，人工智能生成的 PLS 无法满足所有消费者沟通标准，如文本设计和视觉表现。进一步的测试应包括消费者审稿人，并探索如何在起草复杂证据综述的 PLS 文本时最好地利用 LLM 支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cochrane Evidence Synthesis and Methods

自引率

0.00%

发文量