ChatGPT-4o Compared With Human Researchers in Writing Plain-Language Summaries for Cochrane Reviews: A Blinded, Randomized Non-Inferiority Controlled Trial

Cochrane Evidence Synthesis and Methods Pub Date : 2025-07-28 DOI:10.1002/cesm.70037

Dagný Halla Ágústsdóttir, Jacob Rosenberg, Jason Joe Baker

{"title":"ChatGPT-4o Compared With Human Researchers in Writing Plain-Language Summaries for Cochrane Reviews: A Blinded, Randomized Non-Inferiority Controlled Trial","authors":"Dagný Halla Ágústsdóttir, Jacob Rosenberg, Jason Joe Baker","doi":"10.1002/cesm.70037","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>Plain language summaries in Cochrane reviews are designed to present key information in a way that is understandable to individuals without a medical background. Despite Cochrane's author guidelines, these summaries often fail to achieve their intended purpose. Studies show that they are generally difficult to read and vary in their adherence to the guidelines. Artificial intelligence is increasingly used in medicine and academia, with its potential being tested in various roles. This study aimed to investigate whether ChatGPT-4o could produce plain language summaries that are as good as the already published plain language summaries in Cochrane reviews.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We conducted a randomized, single-blinded study with a total of 36 plain language summaries: 18 human written and 18 ChatGPT-4o generated summaries where both versions were for the same Cochrane reviews. The sample size was calculated to be 36 and each summary was evaluated four times. Each summary was reviewed twice by members of a Cochrane editorial group and twice by laypersons. The summaries were assessed in three different ways: First, all assessors evaluated the summaries for informativeness, readability, and level of detail using a Likert scale from 1 to 10. They were also asked whether they would submit the summary and whether they could identify who had written it. Second, members of a Cochrane editorial group assessed the summaries using a checklist based on Cochrane's guidelines for plain language summaries, with scores ranging from 0 to 10. Finally, the readability of the summaries was analyzed using objective tools such as Lix and Flesch-Kincaid scores. Randomization and allocation to either ChatGPT-4o or human written summaries were conducted using random.org's random sequence generator, and assessors were blinded to the authorship of the summaries.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>The plain language summaries generated by ChatGPT-4o scored 1 point higher on information (<i>p</i> < .001) and level of detail (<i>p</i> = .004), and 2 points higher on readability (<i>p</i> = .002) compared to human written summaries. Lix and Flesch-Kincaid scores were high for both groups of summaries, though ChatGPT was slightly easier to read (<i>p</i> < .001). Assessors found it difficult to distinguish between ChatGPT and human written summaries, with only 20% correctly identifying ChatGPT generated text. ChatGPT summaries were preferred for submission compared to the human written summaries (64% vs. 36%, <i>p</i> < .001).</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>ChatGPT-4o shows promise in creating plain language summaries for Cochrane reviews at least as well as humans and in some cases slightly better. This study suggests ChatGPT-4o's could become a tool for drafting easy-to-understand plain language summaries for Cochrane reviews with a quality approaching or matching human authors.</p>\n </section>\n \n <section>\n \n <h3> Clinical Trial Registration and Protocol</h3>\n \n <p>Available at https://osf.io/aq6r5.</p>\n </section>\n </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70037","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.70037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

Plain language summaries in Cochrane reviews are designed to present key information in a way that is understandable to individuals without a medical background. Despite Cochrane's author guidelines, these summaries often fail to achieve their intended purpose. Studies show that they are generally difficult to read and vary in their adherence to the guidelines. Artificial intelligence is increasingly used in medicine and academia, with its potential being tested in various roles. This study aimed to investigate whether ChatGPT-4o could produce plain language summaries that are as good as the already published plain language summaries in Cochrane reviews.

Methods

We conducted a randomized, single-blinded study with a total of 36 plain language summaries: 18 human written and 18 ChatGPT-4o generated summaries where both versions were for the same Cochrane reviews. The sample size was calculated to be 36 and each summary was evaluated four times. Each summary was reviewed twice by members of a Cochrane editorial group and twice by laypersons. The summaries were assessed in three different ways: First, all assessors evaluated the summaries for informativeness, readability, and level of detail using a Likert scale from 1 to 10. They were also asked whether they would submit the summary and whether they could identify who had written it. Second, members of a Cochrane editorial group assessed the summaries using a checklist based on Cochrane's guidelines for plain language summaries, with scores ranging from 0 to 10. Finally, the readability of the summaries was analyzed using objective tools such as Lix and Flesch-Kincaid scores. Randomization and allocation to either ChatGPT-4o or human written summaries were conducted using random.org's random sequence generator, and assessors were blinded to the authorship of the summaries.

Results

The plain language summaries generated by ChatGPT-4o scored 1 point higher on information (p < .001) and level of detail (p = .004), and 2 points higher on readability (p = .002) compared to human written summaries. Lix and Flesch-Kincaid scores were high for both groups of summaries, though ChatGPT was slightly easier to read (p < .001). Assessors found it difficult to distinguish between ChatGPT and human written summaries, with only 20% correctly identifying ChatGPT generated text. ChatGPT summaries were preferred for submission compared to the human written summaries (64% vs. 36%, p < .001).

Conclusion

ChatGPT-4o shows promise in creating plain language summaries for Cochrane reviews at least as well as humans and in some cases slightly better. This study suggests ChatGPT-4o's could become a tool for drafting easy-to-understand plain language summaries for Cochrane reviews with a quality approaching or matching human authors.

Clinical Trial Registration and Protocol

Available at https://osf.io/aq6r5.

Abstract Image

查看原文本刊更多论文

chatgpt - 40与人类研究人员在Cochrane综述中撰写简单语言摘要的比较：一项盲法，随机非劣效对照试验

Cochrane综述中的简单语言摘要旨在以一种没有医学背景的人也能理解的方式呈现关键信息。尽管Cochrane有作者指南，但这些摘要往往达不到预期目的。研究表明，它们通常难以阅读，并且在遵守指南方面各不相同。人工智能越来越多地应用于医学和学术界，其潜力正在各种角色中得到测试。本研究旨在调查chatgpt - 40是否能产生与Cochrane综述中已发表的普通语言摘要一样好的普通语言摘要。方法：我们进行了一项随机、单盲研究，共有36个简单的语言摘要：18个人类手写的摘要和18个chatgpt - 40生成的摘要，两个版本都用于相同的Cochrane综述。样本量计算为36份，每份摘要评估4次。每个摘要由Cochrane编辑小组的成员和外行人员分别审查两次。摘要以三种不同的方式进行评估：首先，所有评估者使用李克特量表从1到10来评估摘要的信息量、可读性和详细程度。他们还被问及是否会提交摘要，以及是否能确认是谁写的摘要。其次，Cochrane编辑组的成员根据Cochrane的简明语言摘要指南，使用清单对摘要进行评估，得分范围从0到10。最后，使用Lix和Flesch-Kincaid评分等客观工具分析摘要的可读性。使用random.org的随机序列生成器对chatgpt - 40或人类书面摘要进行随机化和分配，评估人员对摘要的作者身份不知情。结果与人类书面摘要相比，chatgpt - 40生成的简明语言摘要在信息（p < .001）和细节水平（p = .004）上得分高1分，在可读性（p = .002）上得分高2分。Lix和Flesch-Kincaid分数在两组摘要中都很高，尽管ChatGPT稍微容易阅读（p < .001）。评估人员发现很难区分ChatGPT和人类写的摘要，只有20%的人能正确识别ChatGPT生成的文本。与人类书面摘要相比，ChatGPT摘要更适合提交（64%对36%，p < .001）。chatgpt - 40在为Cochrane综述创建简单的语言摘要方面显示出希望，至少和人类一样好，在某些情况下略好。这项研究表明，chatgpt - 40可以成为一种工具，为Cochrane综述起草易于理解的简单语言摘要，其质量接近或匹配人类作者。临床试验注册和方案可在https://osf.io/aq6r5获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cochrane Evidence Synthesis and Methods

自引率

0.00%

发文量