特征重要性的去相关概念和趋势的随机森林回归检测

Q2 Computer Science
Yannick Gerstorfer, Max Hahn-Klimroth, Lena Krieg
{"title":"特征重要性的去相关概念和趋势的随机森林回归检测","authors":"Yannick Gerstorfer, Max Hahn-Klimroth, Lena Krieg","doi":"10.5334/dsj-2023-042","DOIUrl":null,"url":null,"abstract":"In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence – i.e., is the feature relevant? And, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as random forest regression have found their way into applications (Boulesteix et al. 2012). These models allow researchers to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al. 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative traversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Notion of Feature Importance by Decorrelation and Detection of Trends by Random Forest Regression\",\"authors\":\"Yannick Gerstorfer, Max Hahn-Klimroth, Lena Krieg\",\"doi\":\"10.5334/dsj-2023-042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence – i.e., is the feature relevant? And, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as random forest regression have found their way into applications (Boulesteix et al. 2012). These models allow researchers to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al. 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative traversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.\",\"PeriodicalId\":35375,\"journal\":{\"name\":\"Data Science Journal\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data Science Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5334/dsj-2023-042\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5334/dsj-2023-042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 1

摘要

在许多研究中,我们想要确定某些特征对因变量的影响。更具体地说,我们感兴趣的是影响力的强弱。也就是说,功能是否相关?如果有,特征是如何影响因变量的。最近,随机森林回归等数据驱动方法已经进入应用领域(Boulesteix et al. 2012)。这些模型使研究人员能够直接得出特征重要性的度量,这是影响强度的自然指标。对于相关特征,通常使用特征与因变量之间的相关性或等级相关性来确定影响的性质。最近的一些方法基于建模方法,其中一些方法也可以测量特征之间的相互作用。特别是,当使用机器学习模型时,SHAP分数是确定这些趋势的最新和突出的方法(Lundberg et al. 2017)。本文在已有研究的Gram-Schmidt去相关方法的基础上,引入了一种新的特征重要性概念。此外,我们提出了使用随机森林回归识别数据趋势的两个估计量,即所谓的绝对和相对遍历率。我们在经验上比较了我们的估计器与在各种合成和现实世界数据集上建立的估计器的性质。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Notion of Feature Importance by Decorrelation and Detection of Trends by Random Forest Regression
In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence – i.e., is the feature relevant? And, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as random forest regression have found their way into applications (Boulesteix et al. 2012). These models allow researchers to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al. 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative traversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Data Science Journal
Data Science Journal Computer Science-Computer Science (miscellaneous)
CiteScore
5.40
自引率
0.00%
发文量
17
审稿时长
10 weeks
期刊介绍: The Data Science Journal is a peer-reviewed electronic journal publishing papers on the management of data and databases in Science and Technology. Details can be found in the prospectus. The scope of the journal includes descriptions of data systems, their publication on the internet, applications and legal issues. All of the Sciences are covered, including the Physical Sciences, Engineering, the Geosciences and the Biosciences, along with Agriculture and the Medical Science. The journal publishes papers about data and data systems; it does not publish data or data compilations. However it may publish papers about methods of data compilation or analysis.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信