Assessing the quality of prediction models in health care using the Prediction model Risk Of Bias ASsessment Tool (PROBAST): an evaluation of its use and practical application

IF 7.3 2区 医学 Q1 HEALTH CARE SCIENCES & SERVICES
Tabea Kaul , Johanna A.A. Damen , Laure Wynants , Ben Van Calster , Maarten van Smeden , Lotty Hooft , Karel G.M. Moons
{"title":"Assessing the quality of prediction models in health care using the Prediction model Risk Of Bias ASsessment Tool (PROBAST): an evaluation of its use and practical application","authors":"Tabea Kaul ,&nbsp;Johanna A.A. Damen ,&nbsp;Laure Wynants ,&nbsp;Ben Van Calster ,&nbsp;Maarten van Smeden ,&nbsp;Lotty Hooft ,&nbsp;Karel G.M. Moons","doi":"10.1016/j.jclinepi.2025.111732","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Objectives</h3><div>Since 2019, the Prediction model Risk Of Bias ASsessment Tool (PROBAST; <span><span>www.probast.org</span><svg><path></path></svg></span>) has supported methodological quality assessments of prediction model studies. Most prediction model studies are rated with a “High” risk of bias (ROB) and researchers report low interrater reliability (IRR) using PROBAST. We aimed to (1) assess the IRR of PROBAST ratings between assessors of the same study and understand reasons for discrepancies, (2) determine which items contribute most to domain-level ROB ratings, and (3) explore the impact of consensus meetings.</div></div><div><h3>Study Design and Setting</h3><div>We used PROBAST assessments from a systematic review of diagnostic and prognostic COVID-19 prediction models as a case study. Assessors included international experts in prediction model studies or their reviews. We assessed IRR using prevalence-adjusted bias-adjusted kappa (PABAK) before consensus meetings, examined bias ratings per domain-level ROB judgments, and evaluated the impact of consensus meetings by identifying rating changes after discussion.</div></div><div><h3>Results</h3><div>We analyzed 2167 PROBAST assessments from 27 assessor pairs covering 760 prediction models: 384 developments, 242 validations, and 134 mixed assessments (including both). The IRR using PABAK was higher for overall ROB judgments (development: 0.82 [0.76; 0.89]; validation: 0.78 [0.68; 0.88]) compared to domain- and item-level judgments. Some PROBAST items frequently contributed to domain-level ROB judgments, eg, 3.5 Outcome blinding and 4.1 Sample size. Consensus discussions mainly led to item-level and never to overall ROB rating changes.</div></div><div><h3>Conclusion</h3><div>Within this case study, PROBAST assessments received high IRR at the overall ROB level, with some variation at item- and domain-level. To reduce variability, PROBAST assessors should standardize item- and domain-level judgments and hold well-structured consensus meetings between assessors of the same study.</div></div><div><h3>Plain Language Summary</h3><div>The Prediction model Risk Of Bias ASsessment Tool (PROBAST; <span><span>www.probast.org</span><svg><path></path></svg></span>) provides a set of items to assess the quality of medical studies on so-called prediction tools that calculate an individual's probability of having or developing a certain disease or health outcome. Previous research found low interrater reliability (IRR; ie, how consistently two assessors rate aspects of the same study) when using PROBAST. To understand why this is the case, we conducted a large study involving more than 30 experts from around the world, all of whom applied PROBAST to the same set of prediction tool studies. Based on more than 2150 PROBAST assessments, we identified which PROBAST items led to the most disagreements between raters, explored reasons for these disagreements, and examined whether the use of so-called consensus meetings (ie, different assessors of the same study discuss their ratings and decide on a finalized rating) impacted PROBAST ratings. Our study found that the IRR between different assessors of the same study was higher than previously reported. One explanation for the better agreement compared to previous research may be the preplanning on how to assess certain PROBAST aspects before starting the assessments, as well as holding well-structured consensus meetings. These improvements lead to a more effective use of PROBAST in evaluating the trustworthiness and quality of prediction tools in the health-care domain.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"181 ","pages":"Article 111732"},"PeriodicalIF":7.3000,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895435625000654","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background and Objectives

Since 2019, the Prediction model Risk Of Bias ASsessment Tool (PROBAST; www.probast.org) has supported methodological quality assessments of prediction model studies. Most prediction model studies are rated with a “High” risk of bias (ROB) and researchers report low interrater reliability (IRR) using PROBAST. We aimed to (1) assess the IRR of PROBAST ratings between assessors of the same study and understand reasons for discrepancies, (2) determine which items contribute most to domain-level ROB ratings, and (3) explore the impact of consensus meetings.

Study Design and Setting

We used PROBAST assessments from a systematic review of diagnostic and prognostic COVID-19 prediction models as a case study. Assessors included international experts in prediction model studies or their reviews. We assessed IRR using prevalence-adjusted bias-adjusted kappa (PABAK) before consensus meetings, examined bias ratings per domain-level ROB judgments, and evaluated the impact of consensus meetings by identifying rating changes after discussion.

Results

We analyzed 2167 PROBAST assessments from 27 assessor pairs covering 760 prediction models: 384 developments, 242 validations, and 134 mixed assessments (including both). The IRR using PABAK was higher for overall ROB judgments (development: 0.82 [0.76; 0.89]; validation: 0.78 [0.68; 0.88]) compared to domain- and item-level judgments. Some PROBAST items frequently contributed to domain-level ROB judgments, eg, 3.5 Outcome blinding and 4.1 Sample size. Consensus discussions mainly led to item-level and never to overall ROB rating changes.

Conclusion

Within this case study, PROBAST assessments received high IRR at the overall ROB level, with some variation at item- and domain-level. To reduce variability, PROBAST assessors should standardize item- and domain-level judgments and hold well-structured consensus meetings between assessors of the same study.

Plain Language Summary

The Prediction model Risk Of Bias ASsessment Tool (PROBAST; www.probast.org) provides a set of items to assess the quality of medical studies on so-called prediction tools that calculate an individual's probability of having or developing a certain disease or health outcome. Previous research found low interrater reliability (IRR; ie, how consistently two assessors rate aspects of the same study) when using PROBAST. To understand why this is the case, we conducted a large study involving more than 30 experts from around the world, all of whom applied PROBAST to the same set of prediction tool studies. Based on more than 2150 PROBAST assessments, we identified which PROBAST items led to the most disagreements between raters, explored reasons for these disagreements, and examined whether the use of so-called consensus meetings (ie, different assessors of the same study discuss their ratings and decide on a finalized rating) impacted PROBAST ratings. Our study found that the IRR between different assessors of the same study was higher than previously reported. One explanation for the better agreement compared to previous research may be the preplanning on how to assess certain PROBAST aspects before starting the assessments, as well as holding well-structured consensus meetings. These improvements lead to a more effective use of PROBAST in evaluating the trustworthiness and quality of prediction tools in the health-care domain.
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Clinical Epidemiology
Journal of Clinical Epidemiology 医学-公共卫生、环境卫生与职业卫生
CiteScore
12.00
自引率
6.90%
发文量
320
审稿时长
44 days
期刊介绍: The Journal of Clinical Epidemiology strives to enhance the quality of clinical and patient-oriented healthcare research by advancing and applying innovative methods in conducting, presenting, synthesizing, disseminating, and translating research results into optimal clinical practice. Special emphasis is placed on training new generations of scientists and clinical practice leaders.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信