Resampling-Based Variable Selection with Lasso for p >> n and Partially Linear Models

2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2015-12-01 DOI:10.1109/ICMLA.2015.134

Mihaela A. Mares, Yike Guo

{"title":"Resampling-Based Variable Selection with Lasso for p >> n and Partially Linear Models","authors":"Mihaela A. Mares, Yike Guo","doi":"10.1109/ICMLA.2015.134","DOIUrl":null,"url":null,"abstract":"The linear model of the regression function is a widely used and perhaps, in most cases, highly unrealistic simplifying assumption, when proposing consistent variable selection methods for large and highly-dimensional datasets. In this paper, we study what happens from theoretical point of view, when a variable selection method assumes a linear regression function and the underlying ground-truth model is composed of a linear and a non-linear term, that is at most partially linear. We demonstrate consistency of the Lasso method when the model is partially linear. However, we note that the algorithm tends to increase even more the number of selected false positives on partially linear models when given few training samples. That is usually because the values of small groups of samples happen to explain variation coming from the non-linear part of the response function and the noise, using a linear combination of wrong predictors. We demonstrate theoretically that false positives are likely to be selected by the Lasso method due to a small proportion of samples, which happen to explain some variation in the response variable. We show that this property implies that if we run the Lasso on several slightly smaller size data replications, sampled without replacement, and intersect the results, we are likely to reduce the number of false positives without losing already selected true positives. We propose a novel consistent variable selection algorithm based on this property and we show it can outperform other variable selection methods on synthetic datasets of linear and partially linear models and datasets from the UCI machine learning repository.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The linear model of the regression function is a widely used and perhaps, in most cases, highly unrealistic simplifying assumption, when proposing consistent variable selection methods for large and highly-dimensional datasets. In this paper, we study what happens from theoretical point of view, when a variable selection method assumes a linear regression function and the underlying ground-truth model is composed of a linear and a non-linear term, that is at most partially linear. We demonstrate consistency of the Lasso method when the model is partially linear. However, we note that the algorithm tends to increase even more the number of selected false positives on partially linear models when given few training samples. That is usually because the values of small groups of samples happen to explain variation coming from the non-linear part of the response function and the noise, using a linear combination of wrong predictors. We demonstrate theoretically that false positives are likely to be selected by the Lasso method due to a small proportion of samples, which happen to explain some variation in the response variable. We show that this property implies that if we run the Lasso on several slightly smaller size data replications, sampled without replacement, and intersect the results, we are likely to reduce the number of false positives without losing already selected true positives. We propose a novel consistent variable selection algorithm based on this property and we show it can outperform other variable selection methods on synthetic datasets of linear and partially linear models and datasets from the UCI machine learning repository.

查看原文本刊更多论文

p >> n和部分线性模型的Lasso重采样变量选择

回归函数的线性模型是一个广泛使用的，也许，在大多数情况下，高度不切实际的简化假设，当提出一致的变量选择方法时，大型和高维数据集。在本文中，我们从理论的角度研究了当变量选择方法假设一个线性回归函数，并且底层的真值模型由一个线性项和一个非线性项组成，该项最多是部分线性时，会发生什么。我们证明了Lasso方法在模型部分线性时的一致性。然而，我们注意到，当给定较少的训练样本时，该算法倾向于在部分线性模型上增加更多选择的假阳性数量。这通常是因为小样本组的值恰好解释了来自响应函数的非线性部分和噪声的变化，使用了错误预测因子的线性组合。我们从理论上证明，由于样本比例很小，Lasso方法可能会选择假阳性，这恰好解释了响应变量的一些变化。我们表明，这一性质意味着，如果我们在几个稍微小一点的数据复制上运行Lasso，不进行替换采样，并使结果相交，我们可能会减少假阳性的数量，而不会丢失已经选择的真阳性。我们基于这一特性提出了一种新的一致性变量选择算法，并证明它在线性和部分线性模型的合成数据集以及UCI机器学习存储库的数据集上优于其他变量选择方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量