去偏随机森林变量选择

ERN: Other Econometrics: Data Collection & Data Estimation Methodology (Topic) Pub Date : 2011-12-22 DOI:10.2139/ssrn.1975801

Dhruv Sharma

{"title":"去偏随机森林变量选择","authors":"Dhruv Sharma","doi":"10.2139/ssrn.1975801","DOIUrl":null,"url":null,"abstract":"This paper proposes a new way to de-bias random forest variable selection using a clean random forest algorithm. Strobl etal (2007) have shown random forest to be biased towards variables with many levels or categories and scales and correlated variables which might result in some inflated variable importance measures. The proposed algorithm builds random forests without each variable and keeps variables when dropping them degrades the overall random forest performance. The algorithm is simple and straight forward and its complexity and speed is a function of the number of salient variables. It runs more efficiently than the permutation test algorithm and is an alternative method to address known biases. The paper concludes some normative guidance on how to use random forest variable importance.","PeriodicalId":384078,"journal":{"name":"ERN: Other Econometrics: Data Collection & Data Estimation Methodology (Topic)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"De-Biased Random Forest Variable Selection\",\"authors\":\"Dhruv Sharma\",\"doi\":\"10.2139/ssrn.1975801\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a new way to de-bias random forest variable selection using a clean random forest algorithm. Strobl etal (2007) have shown random forest to be biased towards variables with many levels or categories and scales and correlated variables which might result in some inflated variable importance measures. The proposed algorithm builds random forests without each variable and keeps variables when dropping them degrades the overall random forest performance. The algorithm is simple and straight forward and its complexity and speed is a function of the number of salient variables. It runs more efficiently than the permutation test algorithm and is an alternative method to address known biases. The paper concludes some normative guidance on how to use random forest variable importance.\",\"PeriodicalId\":384078,\"journal\":{\"name\":\"ERN: Other Econometrics: Data Collection & Data Estimation Methodology (Topic)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ERN: Other Econometrics: Data Collection & Data Estimation Methodology (Topic)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2139/ssrn.1975801\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ERN: Other Econometrics: Data Collection & Data Estimation Methodology (Topic)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.1975801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本文提出了一种利用干净随机森林算法消除随机森林变量选择偏差的新方法。stroble etal(2007)已经表明随机森林偏向于具有许多水平或类别和规模的变量和相关变量，这可能导致一些膨胀的变量重要性度量。该算法构建不包含每个变量的随机森林，并在删除变量时保留变量，从而降低了随机森林的整体性能。该算法简单直接，其复杂度和速度是显著变量数量的函数。它比排列测试算法运行更有效，是解决已知偏差的另一种方法。本文对随机森林变量重要性的使用提出了一些规范的指导意见。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

De-Biased Random Forest Variable Selection

This paper proposes a new way to de-bias random forest variable selection using a clean random forest algorithm. Strobl etal (2007) have shown random forest to be biased towards variables with many levels or categories and scales and correlated variables which might result in some inflated variable importance measures. The proposed algorithm builds random forests without each variable and keeps variables when dropping them degrades the overall random forest performance. The algorithm is simple and straight forward and its complexity and speed is a function of the number of salient variables. It runs more efficiently than the permutation test algorithm and is an alternative method to address known biases. The paper concludes some normative guidance on how to use random forest variable importance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ERN: Other Econometrics: Data Collection & Data Estimation Methodology (Topic)

自引率

0.00%

发文量