Statistical Inference for Variable Importance

IF 1.2 4区数学

International Journal of Biostatistics Pub Date : 1900-01-01 DOI:10.2202/1557-4679.1008

M. J. van der Laan

{"title":"Statistical Inference for Variable Importance","authors":"M. J. van der Laan","doi":"10.2202/1557-4679.1008","DOIUrl":null,"url":null,"abstract":"Many statistical problems involve the learning of an importance/effect of a variable for predicting an outcome of interest based on observing a sample of $n$ independent and identically distributed observations on a list of input variables and an outcome. For example, though prediction/machine learning is, in principle, concerned with learning the optimal unknown mapping from input variables to an outcome from the data, the typical reported output is a list of importance measures for each input variable. The approach in prediction has been to learn the unknown optimal predictor from the data and derive, for each of the input variables, the variable importance from the obtained fit. In this article we propose a new approach which involves for each variable separately 1) defining variable importance as a real valued parameter, 2) deriving the efficient influence curve and thereby optimal estimating function for this parameter in the assumed (possibly nonparametric) model, and 3) develop a corresponding double robust locally efficient estimator of this variable importance, obtained by substituting for the nuisance parameters in the optimal estimating function data adaptive estimators. We illustrate this methodology in the context of prediction, and obtain in this manner double robust locally optimal estimators of marginal variable importance, accompanied with p-values and confidence intervals. In addition, we present a model based and machine learning approach to estimate covariate-adjusted variable importance. Finally, we generalize this methodology to variable importance parameters for time-dependent variables.","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":"2 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2202/1557-4679.1008","citationCount":"163","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.2202/1557-4679.1008","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 163

Abstract

Many statistical problems involve the learning of an importance/effect of a variable for predicting an outcome of interest based on observing a sample of $n$ independent and identically distributed observations on a list of input variables and an outcome. For example, though prediction/machine learning is, in principle, concerned with learning the optimal unknown mapping from input variables to an outcome from the data, the typical reported output is a list of importance measures for each input variable. The approach in prediction has been to learn the unknown optimal predictor from the data and derive, for each of the input variables, the variable importance from the obtained fit. In this article we propose a new approach which involves for each variable separately 1) defining variable importance as a real valued parameter, 2) deriving the efficient influence curve and thereby optimal estimating function for this parameter in the assumed (possibly nonparametric) model, and 3) develop a corresponding double robust locally efficient estimator of this variable importance, obtained by substituting for the nuisance parameters in the optimal estimating function data adaptive estimators. We illustrate this methodology in the context of prediction, and obtain in this manner double robust locally optimal estimators of marginal variable importance, accompanied with p-values and confidence intervals. In addition, we present a model based and machine learning approach to estimate covariate-adjusted variable importance. Finally, we generalize this methodology to variable importance parameters for time-dependent variables.

查看原文本刊更多论文

变量重要性的统计推断

许多统计问题涉及到学习变量的重要性/效果，以便根据在输入变量和结果列表上观察n个独立且相同分布的观察样本来预测感兴趣的结果。例如，虽然预测/机器学习原则上关注的是学习从输入变量到数据结果的最佳未知映射，但典型的报告输出是每个输入变量的重要性度量列表。预测的方法是从数据中学习未知的最优预测器，并从得到的拟合中导出每个输入变量的变量重要性。在本文中，我们提出了一种新的方法，它涉及到对每个变量分别1)将变量重要性定义为实值参数，2)推导有效影响曲线，从而在假设的(可能是非参数的)模型中对该参数进行最优估计函数，以及3)开发相应的双鲁棒局部有效估计该变量重要性。通过将最优估计函数中的扰值参数代入自适应估计器得到。我们在预测的背景下说明了这种方法，并以这种方式获得了边缘变量重要性的双鲁棒局部最优估计，伴随着p值和置信区间。此外，我们提出了一种基于模型和机器学习的方法来估计协变量调整后的变量重要性。最后，我们将此方法推广到时间相关变量的可变重要参数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Biostatistics Mathematics-Statistics and Probability

CiteScore

2.30

自引率

8.30%

发文量

期刊介绍： The International Journal of Biostatistics (IJB) seeks to publish new biostatistical models and methods, new statistical theory, as well as original applications of statistical methods, for important practical problems arising from the biological, medical, public health, and agricultural sciences with an emphasis on semiparametric methods. Given many alternatives to publish exist within biostatistics, IJB offers a place to publish for research in biostatistics focusing on modern methods, often based on machine-learning and other data-adaptive methodologies, as well as providing a unique reading experience that compels the author to be explicit about the statistical inference problem addressed by the paper. IJB is intended that the journal cover the entire range of biostatistics, from theoretical advances to relevant and sensible translations of a practical problem into a statistical framework. Electronic publication also allows for data and software code to be appended, and opens the door for reproducible research allowing readers to easily replicate analyses described in a paper. Both original research and review articles will be warmly received, as will articles applying sound statistical methods to practical problems.