Variability and Advancements in ChatGPT Risk of Bias Assessments: A Replication and Comparative Analysis

IF 3.5 2区医学 Q1 MEDICINE, GENERAL & INTERNAL

Journal of Evidence‐Based Medicine Pub Date : 2025-06-17 DOI:10.1111/jebm.70046

Jules Descamps, Matthieu Resche-Rigon, Guillaume Draznieks, Cesar Quirino, Rémy Nizard, Pierre-Alban Bouché

{"title":"Variability and Advancements in ChatGPT Risk of Bias Assessments: A Replication and Comparative Analysis","authors":"Jules Descamps, Matthieu Resche-Rigon, Guillaume Draznieks, Cesar Quirino, Rémy Nizard, Pierre-Alban Bouché","doi":"10.1111/jebm.70046","DOIUrl":null,"url":null,"abstract":"Dear Editor,We read with great interest the letter by Kuitunen et al., in which they evaluate the performance of ChatGPT-4o in conducting risk of bias (RoB) assessments using the Cochrane RoB2 tool. Their study sampled 100 randomized controlled trials (RCTs) from recent meta-analyses published in top-tier medical journals, prompting ChatGPT-4o to provide an overall rating (“low,” “some concerns,” or “high”) and domain-specific ratings for each study. The authors found that the interrater agreement was generally slight to poor, aligning with previous smaller scale observations and highlighting that ChatGPT-4o's default outputs may be overly optimistic when determining bias levels.We commend the authors for their systematic approach. Their standardized prompt and focus on RoB2 across a larger sample of RCTs strengthen the validity of their findings. Furthermore, their interrater reliability analyses revealed low correlation coefficients, which underscores the challenges inherent in automating such nuanced evaluations. These results add valuable quantitative data to an area where robust evidence is still emerging.Nevertheless, we wish to highlight certain methodological limitations that warrant further consideration. First, the inclusion of five duplicate articles [1-5]—each cited in two different meta-analyses—introduced a situation in which identical articles had identical “ground truth” RoB2 assessments yet received different evaluations by ChatGPT-4o, illustrating variability in large language model (LLM) responses. For instance, for this article [4], D5 was either low and some concerns. Second, the rule-based determination of the overall RoB (low/some concerns/high) from domain-specific ratings itself is algorithmic and does not necessarily require a generative LLM for completion, suggesting that a simpler automated “classification” method might suffice for this aspect. Third, as Kuitunen et al. acknowledge, relying on a single LLM extraction to generate these assessments may be inherently limited, particularly if the model's output can shift based on small prompt changes or session variability.To address these issues, we replicated the methodology in the same data set of 100 RCTs using both the original ChatGPT-4o and its updated iteration, 4o-new. Additionally, we employed a sophisticated framework (4o-fram) that integrates 4o as an input for processing .jsonl files associated with full-text articles. The 4o-fram framework (DAM Assess Version 1.25.01; DeepDocs LLC) utilizes a systematic multi-step approach; it applies a predefined evaluation grid to OCR-converted full-text PDFs using a language model, then structures the results into a clean, analyzable Excel table (Figure S1). We also tested OpenAI's newer “o1” model using exactly the same prompts. We replicated the analysis using our own dataset and models, comparing the original ChatGPT-4o results reported by Kuitunen et al. to new outputs from the same model (4o-new) and the updated OpenAI model (o1). Although we did not compute weighted Fleiss’ kappa due to reproducibility challenges, our plain Fleiss’ kappa and proportion agreement tables provide a direct comparison of performance across different models and domains. The variability between the original 4o and our 4o-new outputs likely reflects intrinsic fluctuations in LLM responses (Figure 1). For instance, in Domain 1 (D1), the original ChatGPT-4o showed moderate agreement with a Fleiss’ kappa of 0.31 (95% CI 0.25–0.36), whereas our new iteration (4o-new) dropped to −0.05 (95% CI −0.16 to 0.05). The 4o-fram framework, which utilizes ChatGPT-4o with .jsonl files linked to full-text articles, showed notable improvements in agreement across most domains. For instance, in D1, it achieved a Fleiss’ kappa of 0.37 (95% CI 0.29–0.46), substantially higher than the 4o-new iteration and the original 4o outputs. This demonstrates the potential of leveraging full-text data and structured input frameworks to enhance LLM performance in systematic assessments. In contrast, the updated o1 model improved agreement with a kappa of 0.11 (95% CI 0.03–0.19) (Table 1).The proportion agreement varied widely across models and domains. For example, although the original ChatGPT-4o achieved high agreement in certain domains (e.g., 80% in D1 and 85% in D4), the 4o-new iteration showed lower agreement values (42% in D1 and 39% in D4). However, the 4o-fram framework outperformed both, achieving agreement rates of 73% in D1 and 82% in D4, showcasing its robustness when using structured .jsonl inputs and full-text data. Notably, 4o-fram demonstrated consistently superior results compared to o1, which showed variable agreement rates (e.g., 65% in D1 and 76% in D4) (Table 2).These differences highlight both the sensitivity of model outputs to inherent variability and the potential for improvement with newer models, but even more so through better-designed frameworks like 4o-fram. We also would like to balance the results, in a recent review from BMJ, RoB judgments of RCTs included in more than one Cochrane Review differed substantially. The proportion agreement from humans ranged from 57% to 81% [6], which could moderate the conclusion.Several limitations should be acknowledged in our study. First, we did not compute weighted Fleiss’ kappa statistics due to reproducibility challenges, which may limit direct comparability with the original study's metrics. Second, our analysis was constrained to the same 100 RCTs used in the original study, potentially limiting the generalizability of our findings to broader systematic review contexts. Third, the inherent variability observed between different model iterations (4o vs. 4o-new) highlights the challenge of reproducibility in LLM-based assessments, which remains a significant concern for systematic implementation. Fourth, although the 4o-fram framework showed promising improvements, it requires access to full-text articles and specialized processing infrastructure. Finally, our evaluation focused primarily on agreement metrics rather than exploring the underlying reasons for disagreements, which could provide valuable insights for future framework development.In conclusion, we thank the authors for contributing valuable data on the performance of ChatGPT-4o in RoB assessments. Their findings—and our own subsequent analyses—reveal that while LLM-assisted RoB evaluation continues to face significant limitations [7], the development of structured frameworks can significantly enhance reliability and precision. We believe that further comparative studies, alongside improvements in both model architectures and protocols (e.g., systematic prompts, consensus approaches, and advanced frameworks), will be essential to determining how LLMs can be most effectively and responsibly deployed in systematic reviews and meta-analyses.The authors declare no conflicts of interest.","PeriodicalId":16090,"journal":{"name":"Journal of Evidence‐Based Medicine","volume":"18 2","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jebm.70046","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Evidence‐Based Medicine","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jebm.70046","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Dear Editor,

We read with great interest the letter by Kuitunen et al., in which they evaluate the performance of ChatGPT-4o in conducting risk of bias (RoB) assessments using the Cochrane RoB2 tool. Their study sampled 100 randomized controlled trials (RCTs) from recent meta-analyses published in top-tier medical journals, prompting ChatGPT-4o to provide an overall rating (“low,” “some concerns,” or “high”) and domain-specific ratings for each study. The authors found that the interrater agreement was generally slight to poor, aligning with previous smaller scale observations and highlighting that ChatGPT-4o's default outputs may be overly optimistic when determining bias levels.

We commend the authors for their systematic approach. Their standardized prompt and focus on RoB2 across a larger sample of RCTs strengthen the validity of their findings. Furthermore, their interrater reliability analyses revealed low correlation coefficients, which underscores the challenges inherent in automating such nuanced evaluations. These results add valuable quantitative data to an area where robust evidence is still emerging.

Nevertheless, we wish to highlight certain methodological limitations that warrant further consideration. First, the inclusion of five duplicate articles [1-5]—each cited in two different meta-analyses—introduced a situation in which identical articles had identical “ground truth” RoB2 assessments yet received different evaluations by ChatGPT-4o, illustrating variability in large language model (LLM) responses. For instance, for this article [4], D5 was either low and some concerns. Second, the rule-based determination of the overall RoB (low/some concerns/high) from domain-specific ratings itself is algorithmic and does not necessarily require a generative LLM for completion, suggesting that a simpler automated “classification” method might suffice for this aspect. Third, as Kuitunen et al. acknowledge, relying on a single LLM extraction to generate these assessments may be inherently limited, particularly if the model's output can shift based on small prompt changes or session variability.

To address these issues, we replicated the methodology in the same data set of 100 RCTs using both the original ChatGPT-4o and its updated iteration, 4o-new. Additionally, we employed a sophisticated framework (4o-fram) that integrates 4o as an input for processing .jsonl files associated with full-text articles. The 4o-fram framework (DAM Assess Version 1.25.01; DeepDocs LLC) utilizes a systematic multi-step approach; it applies a predefined evaluation grid to OCR-converted full-text PDFs using a language model, then structures the results into a clean, analyzable Excel table (Figure S1). We also tested OpenAI's newer “o1” model using exactly the same prompts. We replicated the analysis using our own dataset and models, comparing the original ChatGPT-4o results reported by Kuitunen et al. to new outputs from the same model (4o-new) and the updated OpenAI model (o1). Although we did not compute weighted Fleiss’ kappa due to reproducibility challenges, our plain Fleiss’ kappa and proportion agreement tables provide a direct comparison of performance across different models and domains. The variability between the original 4o and our 4o-new outputs likely reflects intrinsic fluctuations in LLM responses (Figure 1). For instance, in Domain 1 (D1), the original ChatGPT-4o showed moderate agreement with a Fleiss’ kappa of 0.31 (95% CI 0.25–0.36), whereas our new iteration (4o-new) dropped to −0.05 (95% CI −0.16 to 0.05). The 4o-fram framework, which utilizes ChatGPT-4o with .jsonl files linked to full-text articles, showed notable improvements in agreement across most domains. For instance, in D1, it achieved a Fleiss’ kappa of 0.37 (95% CI 0.29–0.46), substantially higher than the 4o-new iteration and the original 4o outputs. This demonstrates the potential of leveraging full-text data and structured input frameworks to enhance LLM performance in systematic assessments. In contrast, the updated o1 model improved agreement with a kappa of 0.11 (95% CI 0.03–0.19) (Table 1).

The proportion agreement varied widely across models and domains. For example, although the original ChatGPT-4o achieved high agreement in certain domains (e.g., 80% in D1 and 85% in D4), the 4o-new iteration showed lower agreement values (42% in D1 and 39% in D4). However, the 4o-fram framework outperformed both, achieving agreement rates of 73% in D1 and 82% in D4, showcasing its robustness when using structured .jsonl inputs and full-text data. Notably, 4o-fram demonstrated consistently superior results compared to o1, which showed variable agreement rates (e.g., 65% in D1 and 76% in D4) (Table 2).

These differences highlight both the sensitivity of model outputs to inherent variability and the potential for improvement with newer models, but even more so through better-designed frameworks like 4o-fram. We also would like to balance the results, in a recent review from BMJ, RoB judgments of RCTs included in more than one Cochrane Review differed substantially. The proportion agreement from humans ranged from 57% to 81% [6], which could moderate the conclusion.

Several limitations should be acknowledged in our study. First, we did not compute weighted Fleiss’ kappa statistics due to reproducibility challenges, which may limit direct comparability with the original study's metrics. Second, our analysis was constrained to the same 100 RCTs used in the original study, potentially limiting the generalizability of our findings to broader systematic review contexts. Third, the inherent variability observed between different model iterations (4o vs. 4o-new) highlights the challenge of reproducibility in LLM-based assessments, which remains a significant concern for systematic implementation. Fourth, although the 4o-fram framework showed promising improvements, it requires access to full-text articles and specialized processing infrastructure. Finally, our evaluation focused primarily on agreement metrics rather than exploring the underlying reasons for disagreements, which could provide valuable insights for future framework development.

In conclusion, we thank the authors for contributing valuable data on the performance of ChatGPT-4o in RoB assessments. Their findings—and our own subsequent analyses—reveal that while LLM-assisted RoB evaluation continues to face significant limitations [7], the development of structured frameworks can significantly enhance reliability and precision. We believe that further comparative studies, alongside improvements in both model architectures and protocols (e.g., systematic prompts, consensus approaches, and advanced frameworks), will be essential to determining how LLMs can be most effectively and responsibly deployed in systematic reviews and meta-analyses.

The authors declare no conflicts of interest.

Abstract Image

查看原文本刊更多论文

ChatGPT偏倚风险评估的变异性和进展：复制和比较分析

我们怀着极大的兴趣阅读了Kuitunen等人的来信，他们在信中使用Cochrane RoB2工具评估了chatggt - 40在进行偏倚风险（RoB）评估方面的表现。他们的研究从最近发表在顶级医学期刊上的荟萃分析中抽样了100个随机对照试验（rct），促使chatgpt - 40为每个研究提供总体评级（“低”、“一些关注”或“高”）和特定领域评级。作者发现，解释器的一致性通常是轻微或较差的，与之前较小规模的观察结果一致，并强调在确定偏差水平时，chatgpt - 40的默认输出可能过于乐观。我们赞扬作者的系统方法。他们在更大的随机对照试验样本中对RoB2的标准化提示和关注加强了他们发现的有效性。此外，他们的互判员可靠性分析揭示了低相关系数，这强调了自动化这种微妙评估所固有的挑战。这些结果为一个仍在出现有力证据的领域增添了宝贵的定量数据。然而，我们希望强调值得进一步审议的某些方法上的限制。首先，纳入了五篇重复的文章[1-5]——每一篇都在两个不同的荟萃分析中被引用——引入了一种情况，即相同的文章具有相同的“基本事实”RoB2评估，但却得到了chatgpt - 40的不同评估，这说明了大型语言模型（LLM）反应的差异性。例如，对于本文b[4]， D5要么很低，要么值得关注。其次，基于规则的总体RoB（低/一些关注/高）的确定本身是基于算法的，并不一定需要生成法学硕士来完成，这表明一个更简单的自动“分类”方法可能足以满足这方面的需求。第三，正如Kuitunen等人所承认的那样，依靠单一的LLM提取来生成这些评估可能存在固有的局限性，特别是如果模型的输出可能会基于小的提示变化或会话可变性而发生变化。为了解决这些问题，我们在100个随机对照试验的相同数据集中复制了该方法，使用了原始的chatgpt - 40和它的更新迭代40 -new。此外，我们使用了一个复杂的框架（40 - frame），它集成了40作为处理与全文文章相关的.json文件的输入。40帧框架(DAM评估版本1.25.01；DeepDocs LLC采用系统的多步骤方法；它使用语言模型将预定义的评估网格应用于ocr转换的全文pdf，然后将结果结构化为一个干净的、可分析的Excel表（图S1）。我们还使用完全相同的提示测试了OpenAI的新“01”模型。我们使用自己的数据集和模型复制了分析，将Kuitunen等人报告的原始chatgpt - 40结果与同一模型（40 -new）和更新的OpenAI模型（01）的新输出进行了比较。尽管由于可重复性的挑战，我们没有计算加权的Fleiss kappa，但我们的普通Fleiss kappa和比例协议表提供了不同模型和领域的性能直接比较。原始40和我们的40个新输出之间的可变性可能反映了LLM响应的内在波动（图1）。例如，在域1 （D1）中，原始chatgpt - 40显示出与Fleiss kappa的中等一致性，为0.31 (95% CI 0.25-0.36)，而我们的新迭代（40 -new）下降到- 0.05 （95% CI−0.16至0.05）。40帧框架利用chatgpt - 40和链接到全文文章的.json文件，在大多数域的一致性方面显示出显著的改进。例如，在D1中，它实现了0.37的Fleiss kappa (95% CI 0.29-0.46)，大大高于40个新迭代和原始40个输出。这证明了利用全文数据和结构化输入框架在系统评估中提高法学硕士绩效的潜力。相比之下，更新后的01模型改善了一致性，kappa为0.11 (95% CI 0.03-0.19)（表1）。比例一致性在不同的模型和领域之间差异很大。例如，虽然最初的chatgpt - 40在某些域上达到了很高的一致性（例如D1为80%，D4为85%），但40 -new迭代的一致性值较低（D1为42%，D4为39%）。然而，40帧框架优于两者，在D1和D4中实现了73%和82%的协议率，显示了它在使用结构化.json输入和全文数据时的鲁棒性。值得注意的是，与01相比，40 -fram始终表现出更好的结果，后者显示出不同的一致性率（例如，D1为65%，D4为76%）（表2）。这些差异突出了模型输出对内在可变性的敏感性和新模型改进的潜力，但通过更好设计的框架（如40 - frame）更是如此。我们也想平衡结果，在最近BMJ的一篇综述中，多个Cochrane综述中包含的rct的RoB判断存在很大差异。人类同意的比例在57% ~ 81%之间，可以缓和结论。在我们的研究中应该承认一些局限性。首先，由于可重复性的挑战，我们没有计算加权的Fleiss kappa统计数据，这可能会限制与原始研究指标的直接可比性。其次，我们的分析仅限于原始研究中使用的相同的100个随机对照试验，这可能限制了我们的发现在更广泛的系统综述背景下的普遍性。第三，在不同的模型迭代（40 vs. 40 -new）之间观察到的固有可变性突出了在基于法学硕士的评估中再现性的挑战，这仍然是系统实施的一个重要问题。第四，尽管40帧框架显示出有希望的改进，但它需要访问全文文章和专门的处理基础设施。最后，我们的评估主要集中在协议指标上，而不是探索分歧的潜在原因，这可能为未来的框架开发提供有价值的见解。最后，我们感谢作者在RoB评估中提供的关于chatgpt - 40性能的宝贵数据。他们的发现以及我们自己随后的分析表明，虽然llm辅助的RoB评估仍然面临着显著的局限性，但结构化框架的发展可以显著提高可靠性和精度。我们认为，进一步的比较研究，以及模型架构和协议的改进（例如，系统提示、共识方法和高级框架），对于确定法学硕士如何最有效、最负责任地应用于系统评价和荟萃分析至关重要。作者声明无利益冲突。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Evidence‐Based Medicine MEDICINE, GENERAL & INTERNAL-

CiteScore

11.20

自引率

1.40%

发文量

期刊介绍： The Journal of Evidence-Based Medicine (EMB) is an esteemed international healthcare and medical decision-making journal, dedicated to publishing groundbreaking research outcomes in evidence-based decision-making, research, practice, and education. Serving as the official English-language journal of the Cochrane China Centre and West China Hospital of Sichuan University, we eagerly welcome editorials, commentaries, and systematic reviews encompassing various topics such as clinical trials, policy, drug and patient safety, education, and knowledge translation.