{"title":"Variability and Advancements in ChatGPT Risk of Bias Assessments: A Replication and Comparative Analysis","authors":"Jules Descamps, Matthieu Resche-Rigon, Guillaume Draznieks, Cesar Quirino, Rémy Nizard, Pierre-Alban Bouché","doi":"10.1111/jebm.70046","DOIUrl":null,"url":null,"abstract":"<p>Dear Editor,</p><p>We read with great interest the letter by Kuitunen et al., in which they evaluate the performance of ChatGPT-4o in conducting risk of bias (RoB) assessments using the Cochrane RoB2 tool. Their study sampled 100 randomized controlled trials (RCTs) from recent meta-analyses published in top-tier medical journals, prompting ChatGPT-4o to provide an overall rating (“low,” “some concerns,” or “high”) and domain-specific ratings for each study. The authors found that the interrater agreement was generally slight to poor, aligning with previous smaller scale observations and highlighting that ChatGPT-4o's default outputs may be overly optimistic when determining bias levels.</p><p>We commend the authors for their systematic approach. Their standardized prompt and focus on RoB2 across a larger sample of RCTs strengthen the validity of their findings. Furthermore, their interrater reliability analyses revealed low correlation coefficients, which underscores the challenges inherent in automating such nuanced evaluations. These results add valuable quantitative data to an area where robust evidence is still emerging.</p><p>Nevertheless, we wish to highlight certain methodological limitations that warrant further consideration. First, the inclusion of five duplicate articles [<span>1-5</span>]—each cited in two different meta-analyses—introduced a situation in which identical articles had identical “ground truth” RoB2 assessments yet received different evaluations by ChatGPT-4o, illustrating variability in large language model (LLM) responses. For instance, for this article [<span>4</span>], D5 was either low and some concerns. Second, the rule-based determination of the overall RoB (low/some concerns/high) from domain-specific ratings itself is algorithmic and does not necessarily require a generative LLM for completion, suggesting that a simpler automated “classification” method might suffice for this aspect. Third, as Kuitunen et al. acknowledge, relying on a single LLM extraction to generate these assessments may be inherently limited, particularly if the model's output can shift based on small prompt changes or session variability.</p><p>To address these issues, we replicated the methodology in the same data set of 100 RCTs using both the original ChatGPT-4o and its updated iteration, 4o-new. Additionally, we employed a sophisticated framework (4o-fram) that integrates 4o as an input for processing .jsonl files associated with full-text articles. The 4o-fram framework (DAM Assess Version 1.25.01; DeepDocs LLC) utilizes a systematic multi-step approach; it applies a predefined evaluation grid to OCR-converted full-text PDFs using a language model, then structures the results into a clean, analyzable Excel table (Figure S1). We also tested OpenAI's newer “o1” model using exactly the same prompts. We replicated the analysis using our own dataset and models, comparing the original ChatGPT-4o results reported by Kuitunen et al. to new outputs from the same model (4o-new) and the updated OpenAI model (o1). Although we did not compute weighted Fleiss’ kappa due to reproducibility challenges, our plain Fleiss’ kappa and proportion agreement tables provide a direct comparison of performance across different models and domains. The variability between the original 4o and our 4o-new outputs likely reflects intrinsic fluctuations in LLM responses (Figure 1). For instance, in Domain 1 (D1), the original ChatGPT-4o showed moderate agreement with a Fleiss’ kappa of 0.31 (95% CI 0.25–0.36), whereas our new iteration (4o-new) dropped to −0.05 (95% CI −0.16 to 0.05). The 4o-fram framework, which utilizes ChatGPT-4o with .jsonl files linked to full-text articles, showed notable improvements in agreement across most domains. For instance, in D1, it achieved a Fleiss’ kappa of 0.37 (95% CI 0.29–0.46), substantially higher than the 4o-new iteration and the original 4o outputs. This demonstrates the potential of leveraging full-text data and structured input frameworks to enhance LLM performance in systematic assessments. In contrast, the updated o1 model improved agreement with a kappa of 0.11 (95% CI 0.03–0.19) (Table 1).</p><p>The proportion agreement varied widely across models and domains. For example, although the original ChatGPT-4o achieved high agreement in certain domains (e.g., 80% in D1 and 85% in D4), the 4o-new iteration showed lower agreement values (42% in D1 and 39% in D4). However, the 4o-fram framework outperformed both, achieving agreement rates of 73% in D1 and 82% in D4, showcasing its robustness when using structured .jsonl inputs and full-text data. Notably, 4o-fram demonstrated consistently superior results compared to o1, which showed variable agreement rates (e.g., 65% in D1 and 76% in D4) (Table 2).</p><p>These differences highlight both the sensitivity of model outputs to inherent variability and the potential for improvement with newer models, but even more so through better-designed frameworks like 4o-fram. We also would like to balance the results, in a recent review from BMJ, RoB judgments of RCTs included in more than one Cochrane Review differed substantially. The proportion agreement from humans ranged from 57% to 81% [<span>6</span>], which could moderate the conclusion.</p><p>Several limitations should be acknowledged in our study. First, we did not compute weighted Fleiss’ kappa statistics due to reproducibility challenges, which may limit direct comparability with the original study's metrics. Second, our analysis was constrained to the same 100 RCTs used in the original study, potentially limiting the generalizability of our findings to broader systematic review contexts. Third, the inherent variability observed between different model iterations (4o vs. 4o-new) highlights the challenge of reproducibility in LLM-based assessments, which remains a significant concern for systematic implementation. Fourth, although the 4o-fram framework showed promising improvements, it requires access to full-text articles and specialized processing infrastructure. Finally, our evaluation focused primarily on agreement metrics rather than exploring the underlying reasons for disagreements, which could provide valuable insights for future framework development.</p><p>In conclusion, we thank the authors for contributing valuable data on the performance of ChatGPT-4o in RoB assessments. Their findings—and our own subsequent analyses—reveal that while LLM-assisted RoB evaluation continues to face significant limitations [<span>7</span>], the development of structured frameworks can significantly enhance reliability and precision. We believe that further comparative studies, alongside improvements in both model architectures and protocols (e.g., systematic prompts, consensus approaches, and advanced frameworks), will be essential to determining how LLMs can be most effectively and responsibly deployed in systematic reviews and meta-analyses.</p><p>The authors declare no conflicts of interest.</p>","PeriodicalId":16090,"journal":{"name":"Journal of Evidence‐Based Medicine","volume":"18 2","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jebm.70046","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Evidence‐Based Medicine","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jebm.70046","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0
Abstract
Dear Editor,
We read with great interest the letter by Kuitunen et al., in which they evaluate the performance of ChatGPT-4o in conducting risk of bias (RoB) assessments using the Cochrane RoB2 tool. Their study sampled 100 randomized controlled trials (RCTs) from recent meta-analyses published in top-tier medical journals, prompting ChatGPT-4o to provide an overall rating (“low,” “some concerns,” or “high”) and domain-specific ratings for each study. The authors found that the interrater agreement was generally slight to poor, aligning with previous smaller scale observations and highlighting that ChatGPT-4o's default outputs may be overly optimistic when determining bias levels.
We commend the authors for their systematic approach. Their standardized prompt and focus on RoB2 across a larger sample of RCTs strengthen the validity of their findings. Furthermore, their interrater reliability analyses revealed low correlation coefficients, which underscores the challenges inherent in automating such nuanced evaluations. These results add valuable quantitative data to an area where robust evidence is still emerging.
Nevertheless, we wish to highlight certain methodological limitations that warrant further consideration. First, the inclusion of five duplicate articles [1-5]—each cited in two different meta-analyses—introduced a situation in which identical articles had identical “ground truth” RoB2 assessments yet received different evaluations by ChatGPT-4o, illustrating variability in large language model (LLM) responses. For instance, for this article [4], D5 was either low and some concerns. Second, the rule-based determination of the overall RoB (low/some concerns/high) from domain-specific ratings itself is algorithmic and does not necessarily require a generative LLM for completion, suggesting that a simpler automated “classification” method might suffice for this aspect. Third, as Kuitunen et al. acknowledge, relying on a single LLM extraction to generate these assessments may be inherently limited, particularly if the model's output can shift based on small prompt changes or session variability.
To address these issues, we replicated the methodology in the same data set of 100 RCTs using both the original ChatGPT-4o and its updated iteration, 4o-new. Additionally, we employed a sophisticated framework (4o-fram) that integrates 4o as an input for processing .jsonl files associated with full-text articles. The 4o-fram framework (DAM Assess Version 1.25.01; DeepDocs LLC) utilizes a systematic multi-step approach; it applies a predefined evaluation grid to OCR-converted full-text PDFs using a language model, then structures the results into a clean, analyzable Excel table (Figure S1). We also tested OpenAI's newer “o1” model using exactly the same prompts. We replicated the analysis using our own dataset and models, comparing the original ChatGPT-4o results reported by Kuitunen et al. to new outputs from the same model (4o-new) and the updated OpenAI model (o1). Although we did not compute weighted Fleiss’ kappa due to reproducibility challenges, our plain Fleiss’ kappa and proportion agreement tables provide a direct comparison of performance across different models and domains. The variability between the original 4o and our 4o-new outputs likely reflects intrinsic fluctuations in LLM responses (Figure 1). For instance, in Domain 1 (D1), the original ChatGPT-4o showed moderate agreement with a Fleiss’ kappa of 0.31 (95% CI 0.25–0.36), whereas our new iteration (4o-new) dropped to −0.05 (95% CI −0.16 to 0.05). The 4o-fram framework, which utilizes ChatGPT-4o with .jsonl files linked to full-text articles, showed notable improvements in agreement across most domains. For instance, in D1, it achieved a Fleiss’ kappa of 0.37 (95% CI 0.29–0.46), substantially higher than the 4o-new iteration and the original 4o outputs. This demonstrates the potential of leveraging full-text data and structured input frameworks to enhance LLM performance in systematic assessments. In contrast, the updated o1 model improved agreement with a kappa of 0.11 (95% CI 0.03–0.19) (Table 1).
The proportion agreement varied widely across models and domains. For example, although the original ChatGPT-4o achieved high agreement in certain domains (e.g., 80% in D1 and 85% in D4), the 4o-new iteration showed lower agreement values (42% in D1 and 39% in D4). However, the 4o-fram framework outperformed both, achieving agreement rates of 73% in D1 and 82% in D4, showcasing its robustness when using structured .jsonl inputs and full-text data. Notably, 4o-fram demonstrated consistently superior results compared to o1, which showed variable agreement rates (e.g., 65% in D1 and 76% in D4) (Table 2).
These differences highlight both the sensitivity of model outputs to inherent variability and the potential for improvement with newer models, but even more so through better-designed frameworks like 4o-fram. We also would like to balance the results, in a recent review from BMJ, RoB judgments of RCTs included in more than one Cochrane Review differed substantially. The proportion agreement from humans ranged from 57% to 81% [6], which could moderate the conclusion.
Several limitations should be acknowledged in our study. First, we did not compute weighted Fleiss’ kappa statistics due to reproducibility challenges, which may limit direct comparability with the original study's metrics. Second, our analysis was constrained to the same 100 RCTs used in the original study, potentially limiting the generalizability of our findings to broader systematic review contexts. Third, the inherent variability observed between different model iterations (4o vs. 4o-new) highlights the challenge of reproducibility in LLM-based assessments, which remains a significant concern for systematic implementation. Fourth, although the 4o-fram framework showed promising improvements, it requires access to full-text articles and specialized processing infrastructure. Finally, our evaluation focused primarily on agreement metrics rather than exploring the underlying reasons for disagreements, which could provide valuable insights for future framework development.
In conclusion, we thank the authors for contributing valuable data on the performance of ChatGPT-4o in RoB assessments. Their findings—and our own subsequent analyses—reveal that while LLM-assisted RoB evaluation continues to face significant limitations [7], the development of structured frameworks can significantly enhance reliability and precision. We believe that further comparative studies, alongside improvements in both model architectures and protocols (e.g., systematic prompts, consensus approaches, and advanced frameworks), will be essential to determining how LLMs can be most effectively and responsibly deployed in systematic reviews and meta-analyses.
期刊介绍:
The Journal of Evidence-Based Medicine (EMB) is an esteemed international healthcare and medical decision-making journal, dedicated to publishing groundbreaking research outcomes in evidence-based decision-making, research, practice, and education. Serving as the official English-language journal of the Cochrane China Centre and West China Hospital of Sichuan University, we eagerly welcome editorials, commentaries, and systematic reviews encompassing various topics such as clinical trials, policy, drug and patient safety, education, and knowledge translation.