{"title":"On the Use of K-Fold Cross-Validation to Choose Cutoff Values and Assess the Performance of Predictive Models in Stepwise Regression","authors":"Z. Mahmood, Salahuddin J. Khan","doi":"10.2202/1557-4679.1105","DOIUrl":null,"url":null,"abstract":"This paper addresses a methodological technique of leave-many-out cross-validation for choosing cutoff values in stepwise regression methods for simplifying the final regression model. A practical approach to choose cutoff values through cross-validation is to compute the minimum Predicted Residual Sum of Squares (PRESS). A leave-one-out cross-validation may overestimate the predictive model capabilities, for example see Shao (1993) and So et al (2000). Shao proves with asymptotic results and simulation that the model with the minimum value for the leave-oneout cross validation estimate of predictor errors is often over specified. That is, too many insignificant variables are contained in set βi of the regression model. He recommended using a method that leaves out a subset of observations, called K-fold cross-validation. Leave-many-out procedures can be more adequate in order to obtain significant and optimal results. We describe various investigations for the assessment of performance of predictive regression models, including different values of K in K-fold cross-validation and selecting the best possible cutoffvalues for automated model selection methods. We propose a resampling procedure by introducing alternative estimates of boosted cross-validated PRESS values for deciding the number of observations (l) to be omitted and number of folds/subsets (K) subsequently in K-fold cross-validation. Salahuddin and Hawkes (1991) used leave-one-out cross-validation to select equal cutoff values in stepwise regression which minimizes PRESS. We concentrate on applying K-fold cross-validation to choose unequal cutoff values that is F-to-enter and F-to-remove values which are then used for determining predictor variables in a regression model from the full data set. Our computer program for K-fold cross-validation can be efficiently used for choosing both equal and unequal cutoff values for automated model selection methods. Some previously analyzed data and Monte Carlo simulation are used to evaluate the proposed method against alternatives through a design experiment approach.","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":"5 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2202/1557-4679.1105","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.2202/1557-4679.1105","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 31
Abstract
This paper addresses a methodological technique of leave-many-out cross-validation for choosing cutoff values in stepwise regression methods for simplifying the final regression model. A practical approach to choose cutoff values through cross-validation is to compute the minimum Predicted Residual Sum of Squares (PRESS). A leave-one-out cross-validation may overestimate the predictive model capabilities, for example see Shao (1993) and So et al (2000). Shao proves with asymptotic results and simulation that the model with the minimum value for the leave-oneout cross validation estimate of predictor errors is often over specified. That is, too many insignificant variables are contained in set βi of the regression model. He recommended using a method that leaves out a subset of observations, called K-fold cross-validation. Leave-many-out procedures can be more adequate in order to obtain significant and optimal results. We describe various investigations for the assessment of performance of predictive regression models, including different values of K in K-fold cross-validation and selecting the best possible cutoffvalues for automated model selection methods. We propose a resampling procedure by introducing alternative estimates of boosted cross-validated PRESS values for deciding the number of observations (l) to be omitted and number of folds/subsets (K) subsequently in K-fold cross-validation. Salahuddin and Hawkes (1991) used leave-one-out cross-validation to select equal cutoff values in stepwise regression which minimizes PRESS. We concentrate on applying K-fold cross-validation to choose unequal cutoff values that is F-to-enter and F-to-remove values which are then used for determining predictor variables in a regression model from the full data set. Our computer program for K-fold cross-validation can be efficiently used for choosing both equal and unequal cutoff values for automated model selection methods. Some previously analyzed data and Monte Carlo simulation are used to evaluate the proposed method against alternatives through a design experiment approach.
本文讨论了在逐步回归方法中选择截止值以简化最终回归模型的留多交叉验证方法技术。通过交叉验证选择截止值的一个实用方法是计算最小预测残差平方和(PRESS)。留一交叉验证可能会高估预测模型的能力,例如参见Shao(1993)和So et al(2000)。Shao用渐近结果和仿真证明了预测器误差的留一交叉验证估计的最小值的模型往往是过度指定的。即回归模型的集合βi中包含了太多的不显著变量。他建议使用一种方法,即K-fold交叉验证,这种方法可以剔除一部分观察结果。为了获得显著和最佳的结果,省略许多程序可能更充分。我们描述了评估预测回归模型性能的各种研究,包括K-fold交叉验证中的不同K值以及为自动模型选择方法选择最佳可能的截止值。我们提出了一种重采样程序,通过引入增强交叉验证PRESS值的替代估计来决定K-fold交叉验证中要忽略的观测数(l)和随后的折叠/子集数(K)。Salahuddin和Hawkes(1991)使用留一交叉验证在逐步回归中选择相等的截止值,从而使PRESS最小化。我们专注于应用K-fold交叉验证来选择不相等的截止值,即F-to-enter和F-to-remove值,然后用于从完整数据集确定回归模型中的预测变量。我们的计算机程序的K-fold交叉验证可以有效地用于选择相等和不相等的截止值自动模型选择方法。利用先前分析的数据和蒙特卡罗模拟,通过设计实验方法对所提出的方法与备选方法进行了评估。
期刊介绍:
The International Journal of Biostatistics (IJB) seeks to publish new biostatistical models and methods, new statistical theory, as well as original applications of statistical methods, for important practical problems arising from the biological, medical, public health, and agricultural sciences with an emphasis on semiparametric methods. Given many alternatives to publish exist within biostatistics, IJB offers a place to publish for research in biostatistics focusing on modern methods, often based on machine-learning and other data-adaptive methodologies, as well as providing a unique reading experience that compels the author to be explicit about the statistical inference problem addressed by the paper. IJB is intended that the journal cover the entire range of biostatistics, from theoretical advances to relevant and sensible translations of a practical problem into a statistical framework. Electronic publication also allows for data and software code to be appended, and opens the door for reproducible research allowing readers to easily replicate analyses described in a paper. Both original research and review articles will be warmly received, as will articles applying sound statistical methods to practical problems.