Using a Large Language Model (ChatGPT-4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers
Christopher James Rose, Julia Bidonde, Martin Ringsten, Julie Glanville, Thomas Potrebny, Chris Cooper, Ashley Elizabeth Muller, Hans Bugge Bergsund, Jose F. Meneses-Echavez, Rigmor C. Berg
{"title":"Using a Large Language Model (ChatGPT-4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers","authors":"Christopher James Rose, Julia Bidonde, Martin Ringsten, Julie Glanville, Thomas Potrebny, Chris Cooper, Ashley Elizabeth Muller, Hans Bugge Bergsund, Jose F. Meneses-Echavez, Rigmor C. Berg","doi":"10.1002/cesm.70048","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Risk of bias (RoB) assessment is a highly skilled task that is time-consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task-specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non-task-specific Internet-scale training sets. They demonstrate human-like abilities and might be able to support tasks like RoB assessment.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Following a published peer-reviewed protocol, we randomly sampled 100 Cochrane reviews. New or updated reviews that evaluated medical interventions, included ≥ 1 eligible trial, and presented human consensus assessments using Cochrane RoB1 or RoB2 were eligible. We excluded reviews performed under emergency conditions (e.g., COVID-19), and those on public health or welfare. We randomly sampled one trial from each review. Trials using individual- or cluster-randomized designs were eligible. We extracted human consensus RoB assessments of the trials from the reviews, and methods texts from the trials. We used 25 review-trial pairs to develop a ChatGPT prompt to assess RoB using trial methods text. We used the prompt and the remaining 75 review-trial pairs to estimate human-ChatGPT agreement for “Overall RoB” (primary outcome) and “RoB due to the randomization process”, and ChatGPT-ChatGPT (intrarater) agreement for “Overall RoB”. We used ChatGPT-4o (February 2025) throughout.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>The 75 reviews were sampled from 35 Cochrane review groups, and all used RoB1. The 75 trials spanned five decades, and all but one were published in English. Human-ChatGPT agreement for “Overall RoB” assessment was 50.7% (95% CI 39.3%–62.0%), substantially higher than expected by chance (<i>p</i> = 0.0015). Human-ChatGPT agreement for “RoB due to the randomization process” was 78.7% (95% CI 69.4%–88.0%; <i>p</i> < 0.001). ChatGPT-ChatGPT agreement was 74.7% (95% CI 64.8%–84.6%; <i>p</i> < 0.001).</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>ChatGPT appears to have some ability to assess RoB and is unlikely to be guessing or “hallucinating”. The estimated agreement for “Overall RoB” is well above estimates of agreement reported for some human reviewers, but below the highest estimates. LLM-based systems for assessing RoB may be able to help streamline and improve evidence synthesis production.</p>\n </section>\n </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70048","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.70048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Risk of bias (RoB) assessment is a highly skilled task that is time-consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task-specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non-task-specific Internet-scale training sets. They demonstrate human-like abilities and might be able to support tasks like RoB assessment.
Methods
Following a published peer-reviewed protocol, we randomly sampled 100 Cochrane reviews. New or updated reviews that evaluated medical interventions, included ≥ 1 eligible trial, and presented human consensus assessments using Cochrane RoB1 or RoB2 were eligible. We excluded reviews performed under emergency conditions (e.g., COVID-19), and those on public health or welfare. We randomly sampled one trial from each review. Trials using individual- or cluster-randomized designs were eligible. We extracted human consensus RoB assessments of the trials from the reviews, and methods texts from the trials. We used 25 review-trial pairs to develop a ChatGPT prompt to assess RoB using trial methods text. We used the prompt and the remaining 75 review-trial pairs to estimate human-ChatGPT agreement for “Overall RoB” (primary outcome) and “RoB due to the randomization process”, and ChatGPT-ChatGPT (intrarater) agreement for “Overall RoB”. We used ChatGPT-4o (February 2025) throughout.
Results
The 75 reviews were sampled from 35 Cochrane review groups, and all used RoB1. The 75 trials spanned five decades, and all but one were published in English. Human-ChatGPT agreement for “Overall RoB” assessment was 50.7% (95% CI 39.3%–62.0%), substantially higher than expected by chance (p = 0.0015). Human-ChatGPT agreement for “RoB due to the randomization process” was 78.7% (95% CI 69.4%–88.0%; p < 0.001). ChatGPT-ChatGPT agreement was 74.7% (95% CI 64.8%–84.6%; p < 0.001).
Conclusions
ChatGPT appears to have some ability to assess RoB and is unlikely to be guessing or “hallucinating”. The estimated agreement for “Overall RoB” is well above estimates of agreement reported for some human reviewers, but below the highest estimates. LLM-based systems for assessing RoB may be able to help streamline and improve evidence synthesis production.
背景偏倚风险(RoB)评估是一项耗时且容易出现人为错误的高技能任务。RoB自动化工具以前使用的是使用相对较小的特定任务训练集构建的机器学习模型。大型语言模型(llm;例如,ChatGPT)是使用非任务特定的互联网规模训练集构建的复杂模型。它们表现出类似人类的能力,可能能够支持像RoB评估这样的任务。方法根据已发表的同行评议方案,我们随机抽取100篇Cochrane综述。评估医疗干预措施的新综述或更新综述,包括≥1个符合条件的试验,并使用Cochrane RoB1或RoB2进行人类共识评估,均符合条件。我们排除了在紧急情况下(例如COVID-19)进行的评价,以及有关公共卫生或福利的评价。我们从每个综述中随机抽取一个试验。采用个体随机或群随机设计的试验是合格的。我们从综述中提取了对试验的人类共识RoB评估,并从试验中提取了方法文本。我们使用25对回顾试验来开发ChatGPT提示,以使用试验方法文本评估RoB。我们使用提示和剩余的75个综述试验对来估计“总体RoB”(主要结局)和“随机化过程导致的RoB”的人类- chatgpt一致性,以及ChatGPT-ChatGPT(内部)“总体RoB”的一致性。我们自始至终使用的是chatgpt - 40(2025年2月)。结果75篇综述来自35个Cochrane综述组,均采用RoB1。这75项试验跨越了50年,除了一项以外,其余都是用英语发表的。Human-ChatGPT对“Overall RoB”评估的一致性为50.7% (95% CI 39.3%-62.0%),大大高于预期(p = 0.0015)。“随机化过程导致的RoB”的人- chatgpt一致性为78.7% (95% CI 69.4%-88.0%; p < 0.001)。ChatGPT-ChatGPT一致性为74.7% (95% CI 64.8%-84.6%; p < 0.001)。结论:ChatGPT似乎有一定的能力来评估RoB,不太可能是猜测或“幻觉”。对于“Overall RoB”的估计的一致性远远高于对一些人类评论者报告的一致性的估计,但是低于最高的估计。基于法学硕士的评估RoB的系统可能有助于简化和改进证据合成生产。