Assessing the quality of prediction models in health care using the Prediction model Risk Of Bias ASsessment Tool (PROBAST): an evaluation of its use and practical application
Tabea Kaul , Johanna A.A. Damen , Laure Wynants , Ben Van Calster , Maarten van Smeden , Lotty Hooft , Karel G.M. Moons
{"title":"Assessing the quality of prediction models in health care using the Prediction model Risk Of Bias ASsessment Tool (PROBAST): an evaluation of its use and practical application","authors":"Tabea Kaul , Johanna A.A. Damen , Laure Wynants , Ben Van Calster , Maarten van Smeden , Lotty Hooft , Karel G.M. Moons","doi":"10.1016/j.jclinepi.2025.111732","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Objectives</h3><div>Since 2019, the Prediction model Risk Of Bias ASsessment Tool (PROBAST; <span><span>www.probast.org</span><svg><path></path></svg></span>) has supported methodological quality assessments of prediction model studies. Most prediction model studies are rated with a “High” risk of bias (ROB) and researchers report low interrater reliability (IRR) using PROBAST. We aimed to (1) assess the IRR of PROBAST ratings between assessors of the same study and understand reasons for discrepancies, (2) determine which items contribute most to domain-level ROB ratings, and (3) explore the impact of consensus meetings.</div></div><div><h3>Study Design and Setting</h3><div>We used PROBAST assessments from a systematic review of diagnostic and prognostic COVID-19 prediction models as a case study. Assessors included international experts in prediction model studies or their reviews. We assessed IRR using prevalence-adjusted bias-adjusted kappa (PABAK) before consensus meetings, examined bias ratings per domain-level ROB judgments, and evaluated the impact of consensus meetings by identifying rating changes after discussion.</div></div><div><h3>Results</h3><div>We analyzed 2167 PROBAST assessments from 27 assessor pairs covering 760 prediction models: 384 developments, 242 validations, and 134 mixed assessments (including both). The IRR using PABAK was higher for overall ROB judgments (development: 0.82 [0.76; 0.89]; validation: 0.78 [0.68; 0.88]) compared to domain- and item-level judgments. Some PROBAST items frequently contributed to domain-level ROB judgments, eg, 3.5 Outcome blinding and 4.1 Sample size. Consensus discussions mainly led to item-level and never to overall ROB rating changes.</div></div><div><h3>Conclusion</h3><div>Within this case study, PROBAST assessments received high IRR at the overall ROB level, with some variation at item- and domain-level. To reduce variability, PROBAST assessors should standardize item- and domain-level judgments and hold well-structured consensus meetings between assessors of the same study.</div></div><div><h3>Plain Language Summary</h3><div>The Prediction model Risk Of Bias ASsessment Tool (PROBAST; <span><span>www.probast.org</span><svg><path></path></svg></span>) provides a set of items to assess the quality of medical studies on so-called prediction tools that calculate an individual's probability of having or developing a certain disease or health outcome. Previous research found low interrater reliability (IRR; ie, how consistently two assessors rate aspects of the same study) when using PROBAST. To understand why this is the case, we conducted a large study involving more than 30 experts from around the world, all of whom applied PROBAST to the same set of prediction tool studies. Based on more than 2150 PROBAST assessments, we identified which PROBAST items led to the most disagreements between raters, explored reasons for these disagreements, and examined whether the use of so-called consensus meetings (ie, different assessors of the same study discuss their ratings and decide on a finalized rating) impacted PROBAST ratings. Our study found that the IRR between different assessors of the same study was higher than previously reported. One explanation for the better agreement compared to previous research may be the preplanning on how to assess certain PROBAST aspects before starting the assessments, as well as holding well-structured consensus meetings. These improvements lead to a more effective use of PROBAST in evaluating the trustworthiness and quality of prediction tools in the health-care domain.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"181 ","pages":"Article 111732"},"PeriodicalIF":7.3000,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895435625000654","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background and Objectives
Since 2019, the Prediction model Risk Of Bias ASsessment Tool (PROBAST; www.probast.org) has supported methodological quality assessments of prediction model studies. Most prediction model studies are rated with a “High” risk of bias (ROB) and researchers report low interrater reliability (IRR) using PROBAST. We aimed to (1) assess the IRR of PROBAST ratings between assessors of the same study and understand reasons for discrepancies, (2) determine which items contribute most to domain-level ROB ratings, and (3) explore the impact of consensus meetings.
Study Design and Setting
We used PROBAST assessments from a systematic review of diagnostic and prognostic COVID-19 prediction models as a case study. Assessors included international experts in prediction model studies or their reviews. We assessed IRR using prevalence-adjusted bias-adjusted kappa (PABAK) before consensus meetings, examined bias ratings per domain-level ROB judgments, and evaluated the impact of consensus meetings by identifying rating changes after discussion.
Results
We analyzed 2167 PROBAST assessments from 27 assessor pairs covering 760 prediction models: 384 developments, 242 validations, and 134 mixed assessments (including both). The IRR using PABAK was higher for overall ROB judgments (development: 0.82 [0.76; 0.89]; validation: 0.78 [0.68; 0.88]) compared to domain- and item-level judgments. Some PROBAST items frequently contributed to domain-level ROB judgments, eg, 3.5 Outcome blinding and 4.1 Sample size. Consensus discussions mainly led to item-level and never to overall ROB rating changes.
Conclusion
Within this case study, PROBAST assessments received high IRR at the overall ROB level, with some variation at item- and domain-level. To reduce variability, PROBAST assessors should standardize item- and domain-level judgments and hold well-structured consensus meetings between assessors of the same study.
Plain Language Summary
The Prediction model Risk Of Bias ASsessment Tool (PROBAST; www.probast.org) provides a set of items to assess the quality of medical studies on so-called prediction tools that calculate an individual's probability of having or developing a certain disease or health outcome. Previous research found low interrater reliability (IRR; ie, how consistently two assessors rate aspects of the same study) when using PROBAST. To understand why this is the case, we conducted a large study involving more than 30 experts from around the world, all of whom applied PROBAST to the same set of prediction tool studies. Based on more than 2150 PROBAST assessments, we identified which PROBAST items led to the most disagreements between raters, explored reasons for these disagreements, and examined whether the use of so-called consensus meetings (ie, different assessors of the same study discuss their ratings and decide on a finalized rating) impacted PROBAST ratings. Our study found that the IRR between different assessors of the same study was higher than previously reported. One explanation for the better agreement compared to previous research may be the preplanning on how to assess certain PROBAST aspects before starting the assessments, as well as holding well-structured consensus meetings. These improvements lead to a more effective use of PROBAST in evaluating the trustworthiness and quality of prediction tools in the health-care domain.
期刊介绍:
The Journal of Clinical Epidemiology strives to enhance the quality of clinical and patient-oriented healthcare research by advancing and applying innovative methods in conducting, presenting, synthesizing, disseminating, and translating research results into optimal clinical practice. Special emphasis is placed on training new generations of scientists and clinical practice leaders.