{"title":"Constructing a binary prediction model with incomplete data: Variable selection to balance fairness and precision.","authors":"He Ren, Chun Wang, Gongjun Xu, David J Weiss","doi":"10.1037/met0000786","DOIUrl":null,"url":null,"abstract":"<p><p>The statistical and pragmatic tension between explanation and prediction is well recognized in psychology. Yarkoni and Westfall (2017) suggested focusing more on predictions, which will ultimately produce better calibrated interpretations. Variable selection methods, such as regularization, are strongly recommended because it will help construct interpretable models while optimizing prediction accuracy. However, when the data contain a nonignorable proportion of missingness, variable selection and model building via penalized regression methods are not straightforward. What further complicates the analysis protocol is when the model performance is evaluated on both prediction accuracy and fairness, the latter is of increasing attention when the predictive outcome has societal implications. This study explored two methods for variable selection with incomplete data: the bootstrap imputation-stability selection (BI-SS) method and the stacked elastic net (SENET) method. Both methods work with multiply imputed data sets but in different ways. BI-SS implements variable selection separately on each imputed bootstrap data set and aggregates the results via stability selection, while SENET stacks all imputed data sets and fits a single pooled model. We thoroughly evaluated their performance using a suite of metrics (including area under the curve, F1 score, and fairness criteria) via three increasingly complex simulation studies. Results reveal that while BI-SS and SENET methods perform almost equally well in settings with generalized linear models, only BI-SS fares well with nested data design because of high computation demand in fitting the regularized generalized linear mixed effects models. Finally, we demonstrated both methods with an example using rich electronic health data. (PsycInfo Database Record (c) 2025 APA, all rights reserved).</p>","PeriodicalId":20782,"journal":{"name":"Psychological methods","volume":" ","pages":""},"PeriodicalIF":7.8000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356495/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychological methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1037/met0000786","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
The statistical and pragmatic tension between explanation and prediction is well recognized in psychology. Yarkoni and Westfall (2017) suggested focusing more on predictions, which will ultimately produce better calibrated interpretations. Variable selection methods, such as regularization, are strongly recommended because it will help construct interpretable models while optimizing prediction accuracy. However, when the data contain a nonignorable proportion of missingness, variable selection and model building via penalized regression methods are not straightforward. What further complicates the analysis protocol is when the model performance is evaluated on both prediction accuracy and fairness, the latter is of increasing attention when the predictive outcome has societal implications. This study explored two methods for variable selection with incomplete data: the bootstrap imputation-stability selection (BI-SS) method and the stacked elastic net (SENET) method. Both methods work with multiply imputed data sets but in different ways. BI-SS implements variable selection separately on each imputed bootstrap data set and aggregates the results via stability selection, while SENET stacks all imputed data sets and fits a single pooled model. We thoroughly evaluated their performance using a suite of metrics (including area under the curve, F1 score, and fairness criteria) via three increasingly complex simulation studies. Results reveal that while BI-SS and SENET methods perform almost equally well in settings with generalized linear models, only BI-SS fares well with nested data design because of high computation demand in fitting the regularized generalized linear mixed effects models. Finally, we demonstrated both methods with an example using rich electronic health data. (PsycInfo Database Record (c) 2025 APA, all rights reserved).
期刊介绍:
Psychological Methods is devoted to the development and dissemination of methods for collecting, analyzing, understanding, and interpreting psychological data. Its purpose is the dissemination of innovations in research design, measurement, methodology, and quantitative and qualitative analysis to the psychological community; its further purpose is to promote effective communication about related substantive and methodological issues. The audience is expected to be diverse and to include those who develop new procedures, those who are responsible for undergraduate and graduate training in design, measurement, and statistics, as well as those who employ those procedures in research.