Bayesian subset selection and variable importance for interpretable prediction and classification.

IF 5.2 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research Pub Date : 2022-01-01

Daniel R Kowal

{"title":"Bayesian subset selection and variable importance for interpretable prediction and classification.","authors":"Daniel R Kowal","doi":"","DOIUrl":null,"url":null,"abstract":"Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model <math><mi>ℳ</mi></math>, we extract a family of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single \"best\" subset and instead advances the broader perspective that often many subsets are highly competitive. The acceptable family of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for any subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via <math><mi>ℳ</mi></math>. For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"23 ","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10723825/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Machine Learning Research","FirstCategoryId":"94","ListUrlMain":"","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model $ℳ$ , we extract a family of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single "best" subset and instead advances the broader perspective that often many subsets are highly competitive. The acceptable family of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for any subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via $ℳ$ . For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy.

本刊更多论文

用于可解释预测和分类的贝叶斯子集选择和变量重要性。

子集选择是可解释学习、科学发现和数据压缩的重要工具。然而，由于选择的不稳定性、缺乏正则化以及选择后推理的困难，经典的子集选择常常被回避。我们从贝叶斯的角度来解决这些难题。给定任何贝叶斯预测模型ℳ，我们就能为线性预测或分类提取一系列近乎最优的变量子集。这一策略不再强调单一 "最佳 "子集的作用，而是从更广阔的视角出发，认为许多子集往往具有很强的竞争力。可接受子集系列为模型解释提供了一条新途径，其主要成员（如最小可接受子集）以及新的（共同）变量重要性度量（基于变量（共同）是否出现在所有、部分或无可接受子集中）均可清晰概括。更广义地说，我们应用贝叶斯决策分析为任何变量子集推导出最优线性系数。这些系数通过ℳ继承了正则化和预测不确定性量化。对于模拟数据和真实数据，所提出的方法在预测、区间估计和变量选择方面都优于其他贝叶斯和频数选择方法。这些工具被应用于具有高度相关协变量的大型教育数据集。我们的分析为预测教育结果的环境、社会经济和人口因素组合提供了独特的见解，并确定了 200 多个不同的变量子集，这些变量子集提供了接近最优的样本外预测准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Machine Learning Research 工程技术-计算机：人工智能

CiteScore

18.80

自引率

0.00%

发文量

审稿时长

3 months

期刊介绍： The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing. JMLR seeks previously unpublished papers on machine learning that contain: new principled algorithms with sound empirical validation, and with justification of theoretical, psychological, or biological nature; experimental and/or theoretical studies yielding new insight into the design and behavior of learning in intelligent systems; accounts of applications of existing techniques that shed light on the strengths and weaknesses of the methods; formalization of new learning tasks (e.g., in the context of new applications) and of methods for assessing performance on those tasks; development of new analytical frameworks that advance theoretical studies of practical learning methods; computational models of data from natural learning systems at the behavioral or neural level; or extremely well-written surveys of existing work.