Model-independent variable selection via the rule-based variable priorit

arXiv - STAT - Machine Learning Pub Date : 2024-09-13 DOI:arxiv-2409.09003

Min Lu, Hemant Ishwaran

{"title":"Model-independent variable selection via the rule-based variable priorit","authors":"Min Lu, Hemant Ishwaran","doi":"arxiv-2409.09003","DOIUrl":null,"url":null,"abstract":"While achieving high prediction accuracy is a fundamental goal in machine\nlearning, an equally important task is finding a small number of features with\nhigh explanatory power. One popular selection technique is permutation\nimportance, which assesses a variable's impact by measuring the change in\nprediction error after permuting the variable. However, this can be problematic\ndue to the need to create artificial data, a problem shared by other methods as\nwell. Another problem is that variable selection methods can be limited by\nbeing model-specific. We introduce a new model-independent approach, Variable\nPriority (VarPro), which works by utilizing rules without the need to generate\nartificial data or evaluate prediction error. The method is relatively easy to\nuse, requiring only the calculation of sample averages of simple statistics,\nand can be applied to many data settings, including regression, classification,\nand survival. We investigate the asymptotic properties of VarPro and show,\namong other things, that VarPro has a consistent filtering property for noise\nvariables. Empirical studies using synthetic and real-world data show the\nmethod achieves a balanced performance and compares favorably to many\nstate-of-the-art procedures currently used for variable selection.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"47 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

While achieving high prediction accuracy is a fundamental goal in machine learning, an equally important task is finding a small number of features with high explanatory power. One popular selection technique is permutation importance, which assesses a variable's impact by measuring the change in prediction error after permuting the variable. However, this can be problematic due to the need to create artificial data, a problem shared by other methods as well. Another problem is that variable selection methods can be limited by being model-specific. We introduce a new model-independent approach, Variable Priority (VarPro), which works by utilizing rules without the need to generate artificial data or evaluate prediction error. The method is relatively easy to use, requiring only the calculation of sample averages of simple statistics, and can be applied to many data settings, including regression, classification, and survival. We investigate the asymptotic properties of VarPro and show, among other things, that VarPro has a consistent filtering property for noise variables. Empirical studies using synthetic and real-world data show the method achieves a balanced performance and compares favorably to many state-of-the-art procedures currently used for variable selection.

查看原文本刊更多论文

通过基于规则的变量优先级选择与模型无关的变量

虽然实现高预测准确率是机器学习的基本目标，但同样重要的任务是找到少量具有高解释力的特征。一种流行的选择技术是置换重要性（permutationimportance），它通过测量变量置换后预测误差的变化来评估变量的影响。然而，由于需要创建人工数据，这可能会产生问题，这也是其他方法共同面临的问题。另一个问题是，变量选择方法可能会受到特定模型的限制。我们引入了一种独立于模型的新方法 VariablePriority (VarPro)，它利用规则进行工作，无需生成人工数据或评估预测误差。这种方法相对简单，只需计算简单统计量的样本平均值，可应用于多种数据设置，包括回归、分类和生存。我们对 VarPro 的渐近特性进行了研究，结果表明 VarPro 对噪声变量具有一致的过滤特性。使用合成数据和真实世界数据进行的实证研究表明，该方法实现了均衡的性能，与目前用于变量选择的许多最先进的程序相比毫不逊色。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - STAT - Machine Learning

自引率

0.00%

发文量