{"title":"A Notion of Feature Importance by Decorrelation and Detection of Trends by Random Forest Regression","authors":"Yannick Gerstorfer, Max Hahn-Klimroth, Lena Krieg","doi":"10.5334/dsj-2023-042","DOIUrl":null,"url":null,"abstract":"In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence – i.e., is the feature relevant? And, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as random forest regression have found their way into applications (Boulesteix et al. 2012). These models allow researchers to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al. 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative traversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5334/dsj-2023-042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 1
Abstract
In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence – i.e., is the feature relevant? And, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as random forest regression have found their way into applications (Boulesteix et al. 2012). These models allow researchers to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al. 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative traversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.
在许多研究中,我们想要确定某些特征对因变量的影响。更具体地说,我们感兴趣的是影响力的强弱。也就是说,功能是否相关?如果有,特征是如何影响因变量的。最近,随机森林回归等数据驱动方法已经进入应用领域(Boulesteix et al. 2012)。这些模型使研究人员能够直接得出特征重要性的度量,这是影响强度的自然指标。对于相关特征,通常使用特征与因变量之间的相关性或等级相关性来确定影响的性质。最近的一些方法基于建模方法,其中一些方法也可以测量特征之间的相互作用。特别是,当使用机器学习模型时,SHAP分数是确定这些趋势的最新和突出的方法(Lundberg et al. 2017)。本文在已有研究的Gram-Schmidt去相关方法的基础上,引入了一种新的特征重要性概念。此外,我们提出了使用随机森林回归识别数据趋势的两个估计量,即所谓的绝对和相对遍历率。我们在经验上比较了我们的估计器与在各种合成和现实世界数据集上建立的估计器的性质。
期刊介绍:
The Data Science Journal is a peer-reviewed electronic journal publishing papers on the management of data and databases in Science and Technology. Details can be found in the prospectus. The scope of the journal includes descriptions of data systems, their publication on the internet, applications and legal issues. All of the Sciences are covered, including the Physical Sciences, Engineering, the Geosciences and the Biosciences, along with Agriculture and the Medical Science. The journal publishes papers about data and data systems; it does not publish data or data compilations. However it may publish papers about methods of data compilation or analysis.