Resampling-Based Variable Selection with Lasso for p >> n and Partially Linear Models

Mihaela A. Mares, Yike Guo
{"title":"Resampling-Based Variable Selection with Lasso for p >> n and Partially Linear Models","authors":"Mihaela A. Mares, Yike Guo","doi":"10.1109/ICMLA.2015.134","DOIUrl":null,"url":null,"abstract":"The linear model of the regression function is a widely used and perhaps, in most cases, highly unrealistic simplifying assumption, when proposing consistent variable selection methods for large and highly-dimensional datasets. In this paper, we study what happens from theoretical point of view, when a variable selection method assumes a linear regression function and the underlying ground-truth model is composed of a linear and a non-linear term, that is at most partially linear. We demonstrate consistency of the Lasso method when the model is partially linear. However, we note that the algorithm tends to increase even more the number of selected false positives on partially linear models when given few training samples. That is usually because the values of small groups of samples happen to explain variation coming from the non-linear part of the response function and the noise, using a linear combination of wrong predictors. We demonstrate theoretically that false positives are likely to be selected by the Lasso method due to a small proportion of samples, which happen to explain some variation in the response variable. We show that this property implies that if we run the Lasso on several slightly smaller size data replications, sampled without replacement, and intersect the results, we are likely to reduce the number of false positives without losing already selected true positives. We propose a novel consistent variable selection algorithm based on this property and we show it can outperform other variable selection methods on synthetic datasets of linear and partially linear models and datasets from the UCI machine learning repository.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The linear model of the regression function is a widely used and perhaps, in most cases, highly unrealistic simplifying assumption, when proposing consistent variable selection methods for large and highly-dimensional datasets. In this paper, we study what happens from theoretical point of view, when a variable selection method assumes a linear regression function and the underlying ground-truth model is composed of a linear and a non-linear term, that is at most partially linear. We demonstrate consistency of the Lasso method when the model is partially linear. However, we note that the algorithm tends to increase even more the number of selected false positives on partially linear models when given few training samples. That is usually because the values of small groups of samples happen to explain variation coming from the non-linear part of the response function and the noise, using a linear combination of wrong predictors. We demonstrate theoretically that false positives are likely to be selected by the Lasso method due to a small proportion of samples, which happen to explain some variation in the response variable. We show that this property implies that if we run the Lasso on several slightly smaller size data replications, sampled without replacement, and intersect the results, we are likely to reduce the number of false positives without losing already selected true positives. We propose a novel consistent variable selection algorithm based on this property and we show it can outperform other variable selection methods on synthetic datasets of linear and partially linear models and datasets from the UCI machine learning repository.
p >> n和部分线性模型的Lasso重采样变量选择
回归函数的线性模型是一个广泛使用的,也许,在大多数情况下,高度不切实际的简化假设,当提出一致的变量选择方法时,大型和高维数据集。在本文中,我们从理论的角度研究了当变量选择方法假设一个线性回归函数,并且底层的真值模型由一个线性项和一个非线性项组成,该项最多是部分线性时,会发生什么。我们证明了Lasso方法在模型部分线性时的一致性。然而,我们注意到,当给定较少的训练样本时,该算法倾向于在部分线性模型上增加更多选择的假阳性数量。这通常是因为小样本组的值恰好解释了来自响应函数的非线性部分和噪声的变化,使用了错误预测因子的线性组合。我们从理论上证明,由于样本比例很小,Lasso方法可能会选择假阳性,这恰好解释了响应变量的一些变化。我们表明,这一性质意味着,如果我们在几个稍微小一点的数据复制上运行Lasso,不进行替换采样,并使结果相交,我们可能会减少假阳性的数量,而不会丢失已经选择的真阳性。我们基于这一特性提出了一种新的一致性变量选择算法,并证明它在线性和部分线性模型的合成数据集以及UCI机器学习存储库的数据集上优于其他变量选择方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信