评估基于树的集成机器学习技术用于临床风险预测的样本量要求。

IF 1.9 3区医学 Q3 HEALTH CARE SCIENCES & SERVICES

Statistical Methods in Medical Research Pub Date : 2025-07-01 Epub Date: 2025-05-14 DOI:10.1177/09622802251338983

Oya Kalaycıoğlu, Menelaos Pavlou, Serhat E Akhanlı, Mark A de Belder, Gareth Ambler, Rumana Z Omar

{"title":"评估基于树的集成机器学习技术用于临床风险预测的样本量要求。","authors":"Oya Kalaycıoğlu, Menelaos Pavlou, Serhat E Akhanlı, Mark A de Belder, Gareth Ambler, Rumana Z Omar","doi":"10.1177/09622802251338983","DOIUrl":null,"url":null,"abstract":"Machine learning techniques (MLTs) are increasingly being used to develop clinical risk prediction models for binary health outcomes but the sample size requirements for developing and validating such models remain unclear. This study investigates whether sample size guidelines that target mean absolute prediction error (MAPE) for logistic regression models can be applied to tree-based ensemble MLTs (bagging, random forests, and boosting). Simulations based on two large cardiovascular datasets were used to evaluate the performance of MLTs in terms of MAPE, calibration, the C-statistic and Brier score, across six data-generating mechanisms (DGMs) and varying sample sizes. When the DGM and analysis model matched, boosting required a sample size 2-3 times larger than recommended; random forests and bagging did not achieve the target MAPE even with a 12-fold increase. For a neutral DGM that did not match any of the analysis models, logistic regression with only main effects and boosting resulted in target MAPE values with a 12-fold increase in the recommended sample size. For external validation, our simulations showed that sample size guidelines to achieve a target precision of the estimated C-statistic were suitable, and thus may be used to inform sample size calculations for MLTs.","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1356-1372"},"PeriodicalIF":1.9000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12308042/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating the sample size requirements of tree-based ensemble machine learning techniques for clinical risk prediction.\",\"authors\":\"Oya Kalaycıoğlu, Menelaos Pavlou, Serhat E Akhanlı, Mark A de Belder, Gareth Ambler, Rumana Z Omar\",\"doi\":\"10.1177/09622802251338983\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning techniques (MLTs) are increasingly being used to develop clinical risk prediction models for binary health outcomes but the sample size requirements for developing and validating such models remain unclear. This study investigates whether sample size guidelines that target mean absolute prediction error (MAPE) for logistic regression models can be applied to tree-based ensemble MLTs (bagging, random forests, and boosting). Simulations based on two large cardiovascular datasets were used to evaluate the performance of MLTs in terms of MAPE, calibration, the C-statistic and Brier score, across six data-generating mechanisms (DGMs) and varying sample sizes. When the DGM and analysis model matched, boosting required a sample size 2-3 times larger than recommended; random forests and bagging did not achieve the target MAPE even with a 12-fold increase. For a neutral DGM that did not match any of the analysis models, logistic regression with only main effects and boosting resulted in target MAPE values with a 12-fold increase in the recommended sample size. For external validation, our simulations showed that sample size guidelines to achieve a target precision of the estimated C-statistic were suitable, and thus may be used to inform sample size calculations for MLTs.\",\"PeriodicalId\":22038,\"journal\":{\"name\":\"Statistical Methods in Medical Research\",\"volume\":\" \",\"pages\":\"1356-1372\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12308042/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Methods in Medical Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/09622802251338983\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/5/14 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Methods in Medical Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09622802251338983","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/14 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

机器学习技术（mlt）越来越多地被用于开发二元健康结果的临床风险预测模型，但开发和验证此类模型的样本量要求仍不清楚。本研究探讨了以逻辑回归模型的平均绝对预测误差（MAPE）为目标的样本量指南是否可以应用于基于树的集成mlt（套袋、随机森林和提升）。基于两个大型心血管数据集的模拟用于评估mlt在MAPE、校准、c统计量和Brier评分方面的性能，跨越六种数据生成机制（dgm）和不同的样本量。当DGM和分析模型匹配时，提升所需的样本量比推荐的大2-3倍；随机森林和套袋即使增加了12倍，也没有达到目标MAPE。对于与任何分析模型都不匹配的中性DGM，只有主效应和促进的逻辑回归导致目标MAPE值增加了12倍的推荐样本量。对于外部验证，我们的模拟表明，实现估计c统计量的目标精度的样本量指南是合适的，因此可以用于通知mlt的样本量计算。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Evaluating the sample size requirements of tree-based ensemble machine learning techniques for clinical risk prediction.

查看原文本刊更多论文

Evaluating the sample size requirements of tree-based ensemble machine learning techniques for clinical risk prediction.

Machine learning techniques (MLTs) are increasingly being used to develop clinical risk prediction models for binary health outcomes but the sample size requirements for developing and validating such models remain unclear. This study investigates whether sample size guidelines that target mean absolute prediction error (MAPE) for logistic regression models can be applied to tree-based ensemble MLTs (bagging, random forests, and boosting). Simulations based on two large cardiovascular datasets were used to evaluate the performance of MLTs in terms of MAPE, calibration, the C-statistic and Brier score, across six data-generating mechanisms (DGMs) and varying sample sizes. When the DGM and analysis model matched, boosting required a sample size 2-3 times larger than recommended; random forests and bagging did not achieve the target MAPE even with a 12-fold increase. For a neutral DGM that did not match any of the analysis models, logistic regression with only main effects and boosting resulted in target MAPE values with a 12-fold increase in the recommended sample size. For external validation, our simulations showed that sample size guidelines to achieve a target precision of the estimated C-statistic were suitable, and thus may be used to inform sample size calculations for MLTs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistical Methods in Medical Research 医学-数学与计算生物学

CiteScore

4.10

自引率

4.30%

发文量

127

审稿时长

>12 weeks

期刊介绍： Statistical Methods in Medical Research is a peer reviewed scholarly journal and is the leading vehicle for articles in all the main areas of medical statistics and an essential reference for all medical statisticians. This unique journal is devoted solely to statistics and medicine and aims to keep professionals abreast of the many powerful statistical techniques now available to the medical profession. This journal is a member of the Committee on Publication Ethics (COPE)