Constructing a binary prediction model with incomplete data: Variable selection to balance fairness and precision.

IF 7.8 1区心理学 Q1 PSYCHOLOGY, MULTIDISCIPLINARY

Psychological methods Pub Date : 2025-08-14 DOI:10.1037/met0000786

He Ren, Chun Wang, Gongjun Xu, David J Weiss

{"title":"Constructing a binary prediction model with incomplete data: Variable selection to balance fairness and precision.","authors":"He Ren, Chun Wang, Gongjun Xu, David J Weiss","doi":"10.1037/met0000786","DOIUrl":null,"url":null,"abstract":"<p><p>The statistical and pragmatic tension between explanation and prediction is well recognized in psychology. Yarkoni and Westfall (2017) suggested focusing more on predictions, which will ultimately produce better calibrated interpretations. Variable selection methods, such as regularization, are strongly recommended because it will help construct interpretable models while optimizing prediction accuracy. However, when the data contain a nonignorable proportion of missingness, variable selection and model building via penalized regression methods are not straightforward. What further complicates the analysis protocol is when the model performance is evaluated on both prediction accuracy and fairness, the latter is of increasing attention when the predictive outcome has societal implications. This study explored two methods for variable selection with incomplete data: the bootstrap imputation-stability selection (BI-SS) method and the stacked elastic net (SENET) method. Both methods work with multiply imputed data sets but in different ways. BI-SS implements variable selection separately on each imputed bootstrap data set and aggregates the results via stability selection, while SENET stacks all imputed data sets and fits a single pooled model. We thoroughly evaluated their performance using a suite of metrics (including area under the curve, F1 score, and fairness criteria) via three increasingly complex simulation studies. Results reveal that while BI-SS and SENET methods perform almost equally well in settings with generalized linear models, only BI-SS fares well with nested data design because of high computation demand in fitting the regularized generalized linear mixed effects models. Finally, we demonstrated both methods with an example using rich electronic health data. (PsycInfo Database Record (c) 2025 APA, all rights reserved).</p>","PeriodicalId":20782,"journal":{"name":"Psychological methods","volume":" ","pages":""},"PeriodicalIF":7.8000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356495/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychological methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1037/met0000786","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

The statistical and pragmatic tension between explanation and prediction is well recognized in psychology. Yarkoni and Westfall (2017) suggested focusing more on predictions, which will ultimately produce better calibrated interpretations. Variable selection methods, such as regularization, are strongly recommended because it will help construct interpretable models while optimizing prediction accuracy. However, when the data contain a nonignorable proportion of missingness, variable selection and model building via penalized regression methods are not straightforward. What further complicates the analysis protocol is when the model performance is evaluated on both prediction accuracy and fairness, the latter is of increasing attention when the predictive outcome has societal implications. This study explored two methods for variable selection with incomplete data: the bootstrap imputation-stability selection (BI-SS) method and the stacked elastic net (SENET) method. Both methods work with multiply imputed data sets but in different ways. BI-SS implements variable selection separately on each imputed bootstrap data set and aggregates the results via stability selection, while SENET stacks all imputed data sets and fits a single pooled model. We thoroughly evaluated their performance using a suite of metrics (including area under the curve, F1 score, and fairness criteria) via three increasingly complex simulation studies. Results reveal that while BI-SS and SENET methods perform almost equally well in settings with generalized linear models, only BI-SS fares well with nested data design because of high computation demand in fitting the regularized generalized linear mixed effects models. Finally, we demonstrated both methods with an example using rich electronic health data. (PsycInfo Database Record (c) 2025 APA, all rights reserved).

查看原文本刊更多论文

构建数据不完全的二元预测模型：平衡公平与精度的变量选择。

在心理学中，解释和预测之间的统计学和语用学张力是公认的。Yarkoni和Westfall（2017）建议更多地关注预测，这最终将产生更好的校准解释。变量选择方法，如正则化，是强烈推荐的，因为它将有助于构建可解释的模型，同时优化预测精度。然而，当数据包含不可忽略的缺失比例时，通过惩罚回归方法进行变量选择和模型构建并不简单。使分析方案进一步复杂化的是，当模型性能同时评估预测准确性和公平性时，后者在预测结果具有社会影响时越来越受到关注。本文探讨了两种不完全数据下的变量选择方法：自举法（BI-SS）和叠弹性网法（SENET）。这两种方法都适用于多输入数据集，但方式不同。BI-SS分别对每个输入的自举数据集进行变量选择，并通过稳定性选择汇总结果，而SENET将所有输入的数据集叠加并拟合单个池模型。我们通过三个日益复杂的模拟研究，使用一系列指标（包括曲线下面积、F1分数和公平性标准）彻底评估了他们的表现。结果表明，虽然BI-SS和SENET方法在广义线性模型设置中表现几乎相同，但只有BI-SS方法在嵌套数据设计中表现良好，因为在拟合正则化广义线性混合效应模型时需要大量的计算量。最后，我们通过一个使用丰富电子健康数据的示例演示了这两种方法。（PsycInfo Database Record (c) 2025 APA，版权所有）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Psychological methods PSYCHOLOGY, MULTIDISCIPLINARY-

CiteScore

13.10

自引率

7.10%

发文量

159

期刊介绍： Psychological Methods is devoted to the development and dissemination of methods for collecting, analyzing, understanding, and interpreting psychological data. Its purpose is the dissemination of innovations in research design, measurement, methodology, and quantitative and qualitative analysis to the psychological community; its further purpose is to promote effective communication about related substantive and methodological issues. The audience is expected to be diverse and to include those who develop new procedures, those who are responsible for undergraduate and graduate training in design, measurement, and statistics, as well as those who employ those procedures in research.