使用低流行率预测因子进行变量选择的随机森林与逐步回归的比较:儿科败血症病例研究。

IF 1.8 4区 医学 Q3 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH
Patricia Gilholm, Paula Lister, Adam Irwin, Amanda Harley, Sainath Raman, Luregn J Schlapbach, Kristen S Gibbons
{"title":"使用低流行率预测因子进行变量选择的随机森林与逐步回归的比较:儿科败血症病例研究。","authors":"Patricia Gilholm, Paula Lister, Adam Irwin, Amanda Harley, Sainath Raman, Luregn J Schlapbach, Kristen S Gibbons","doi":"10.1007/s10995-025-04038-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Variable selection is a common technique to identify the most predictive variables from a pool of candidate predictors. Low prevalence predictors (LPPs) are frequently found in clinical data, yet few studies have explored their impact on model performance during variable selection. This study compared the Random Forest (RF) algorithm and stepwise regression (SWR) for variable selection using data from a paediatric sepsis screening tool, where 18 out of 32 predictors had a prevalence < 10%.</p><p><strong>Methods: </strong>Variable selection using RF was compared to forward and backward SWR. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), and the variables retained. Additionally, a simulation study assessed how increasing the prevalence of the predictors impacted the variable selection results.</p><p><strong>Results: </strong>The best fitting RF and SWR models retained were 22, and 17 predictors, respectively, with 14 and 10 predictors having a prevalence < 10%. Both the RF and SWR models had similar predictive performance (RF: AUC [95% Confidence Interval] 0.79 [0.77, 0.81], LR: 0.80 [0.78, 0.82]). The simulation study revealed differences for both RF and SWR models in variable importance rankings and predictor selection with increasing prevalence thresholds, particularly for moderately and strongly associated predictors.</p><p><strong>Discussion: </strong>The RF algorithm retained a number of very low prevalence predictors compared to SWR. However, the predictive performance of both models were comparable, demonstrating that when applied correctly and the number of candidate predictors is small, both methods are suitable for variable selection when using low prevalence predictors.</p>","PeriodicalId":48367,"journal":{"name":"Maternal and Child Health Journal","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparison of Random Forest and Stepwise Regression for Variable Selection Using Low Prevalence Predictors: A case Study in Paediatric Sepsis.\",\"authors\":\"Patricia Gilholm, Paula Lister, Adam Irwin, Amanda Harley, Sainath Raman, Luregn J Schlapbach, Kristen S Gibbons\",\"doi\":\"10.1007/s10995-025-04038-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>Variable selection is a common technique to identify the most predictive variables from a pool of candidate predictors. Low prevalence predictors (LPPs) are frequently found in clinical data, yet few studies have explored their impact on model performance during variable selection. This study compared the Random Forest (RF) algorithm and stepwise regression (SWR) for variable selection using data from a paediatric sepsis screening tool, where 18 out of 32 predictors had a prevalence < 10%.</p><p><strong>Methods: </strong>Variable selection using RF was compared to forward and backward SWR. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), and the variables retained. Additionally, a simulation study assessed how increasing the prevalence of the predictors impacted the variable selection results.</p><p><strong>Results: </strong>The best fitting RF and SWR models retained were 22, and 17 predictors, respectively, with 14 and 10 predictors having a prevalence < 10%. Both the RF and SWR models had similar predictive performance (RF: AUC [95% Confidence Interval] 0.79 [0.77, 0.81], LR: 0.80 [0.78, 0.82]). The simulation study revealed differences for both RF and SWR models in variable importance rankings and predictor selection with increasing prevalence thresholds, particularly for moderately and strongly associated predictors.</p><p><strong>Discussion: </strong>The RF algorithm retained a number of very low prevalence predictors compared to SWR. However, the predictive performance of both models were comparable, demonstrating that when applied correctly and the number of candidate predictors is small, both methods are suitable for variable selection when using low prevalence predictors.</p>\",\"PeriodicalId\":48367,\"journal\":{\"name\":\"Maternal and Child Health Journal\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Maternal and Child Health Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s10995-025-04038-1\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Maternal and Child Health Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10995-025-04038-1","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0

摘要

简介:变量选择是从候选预测变量池中识别最具预测性变量的常用技术。在临床数据中经常发现低患病率预测因子(LPPs),但很少有研究探讨它们在变量选择过程中对模型性能的影响。本研究比较了随机森林(RF)算法和逐步回归(SWR)的变量选择,使用来自儿科败血症筛查工具的数据,其中32个预测因子中有18个具有患病率。使用接收者工作特征曲线下的面积(AUC)和保留的变量来评估模型性能。此外,一项模拟研究评估了预测因子的增加对变量选择结果的影响。结果:保留的最佳拟合RF和SWR模型分别为22和17个预测因子,其中14和10个预测因子具有患病率。讨论:与SWR相比,RF算法保留了许多非常低的患病率预测因子。然而,两种模型的预测性能具有可比性,这表明当应用正确且候选预测因子数量较少时,两种方法都适用于使用低患病率预测因子时的变量选择。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Comparison of Random Forest and Stepwise Regression for Variable Selection Using Low Prevalence Predictors: A case Study in Paediatric Sepsis.

Introduction: Variable selection is a common technique to identify the most predictive variables from a pool of candidate predictors. Low prevalence predictors (LPPs) are frequently found in clinical data, yet few studies have explored their impact on model performance during variable selection. This study compared the Random Forest (RF) algorithm and stepwise regression (SWR) for variable selection using data from a paediatric sepsis screening tool, where 18 out of 32 predictors had a prevalence < 10%.

Methods: Variable selection using RF was compared to forward and backward SWR. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), and the variables retained. Additionally, a simulation study assessed how increasing the prevalence of the predictors impacted the variable selection results.

Results: The best fitting RF and SWR models retained were 22, and 17 predictors, respectively, with 14 and 10 predictors having a prevalence < 10%. Both the RF and SWR models had similar predictive performance (RF: AUC [95% Confidence Interval] 0.79 [0.77, 0.81], LR: 0.80 [0.78, 0.82]). The simulation study revealed differences for both RF and SWR models in variable importance rankings and predictor selection with increasing prevalence thresholds, particularly for moderately and strongly associated predictors.

Discussion: The RF algorithm retained a number of very low prevalence predictors compared to SWR. However, the predictive performance of both models were comparable, demonstrating that when applied correctly and the number of candidate predictors is small, both methods are suitable for variable selection when using low prevalence predictors.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Maternal and Child Health Journal
Maternal and Child Health Journal PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH-
CiteScore
3.20
自引率
4.30%
发文量
271
期刊介绍: Maternal and Child Health Journal is the first exclusive forum to advance the scientific and professional knowledge base of the maternal and child health (MCH) field. This bimonthly provides peer-reviewed papers addressing the following areas of MCH practice, policy, and research: MCH epidemiology, demography, and health status assessment Innovative MCH service initiatives Implementation of MCH programs MCH policy analysis and advocacy MCH professional development. Exploring the full spectrum of the MCH field, Maternal and Child Health Journal is an important tool for practitioners as well as academics in public health, obstetrics, gynecology, prenatal medicine, pediatrics, and neonatology. Sponsors include the Association of Maternal and Child Health Programs (AMCHP), the Association of Teachers of Maternal and Child Health (ATMCH), and CityMatCH.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信