{"title":"p >> n和部分线性模型的Lasso重采样变量选择","authors":"Mihaela A. Mares, Yike Guo","doi":"10.1109/ICMLA.2015.134","DOIUrl":null,"url":null,"abstract":"The linear model of the regression function is a widely used and perhaps, in most cases, highly unrealistic simplifying assumption, when proposing consistent variable selection methods for large and highly-dimensional datasets. In this paper, we study what happens from theoretical point of view, when a variable selection method assumes a linear regression function and the underlying ground-truth model is composed of a linear and a non-linear term, that is at most partially linear. We demonstrate consistency of the Lasso method when the model is partially linear. However, we note that the algorithm tends to increase even more the number of selected false positives on partially linear models when given few training samples. That is usually because the values of small groups of samples happen to explain variation coming from the non-linear part of the response function and the noise, using a linear combination of wrong predictors. We demonstrate theoretically that false positives are likely to be selected by the Lasso method due to a small proportion of samples, which happen to explain some variation in the response variable. We show that this property implies that if we run the Lasso on several slightly smaller size data replications, sampled without replacement, and intersect the results, we are likely to reduce the number of false positives without losing already selected true positives. We propose a novel consistent variable selection algorithm based on this property and we show it can outperform other variable selection methods on synthetic datasets of linear and partially linear models and datasets from the UCI machine learning repository.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Resampling-Based Variable Selection with Lasso for p >> n and Partially Linear Models\",\"authors\":\"Mihaela A. Mares, Yike Guo\",\"doi\":\"10.1109/ICMLA.2015.134\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The linear model of the regression function is a widely used and perhaps, in most cases, highly unrealistic simplifying assumption, when proposing consistent variable selection methods for large and highly-dimensional datasets. In this paper, we study what happens from theoretical point of view, when a variable selection method assumes a linear regression function and the underlying ground-truth model is composed of a linear and a non-linear term, that is at most partially linear. We demonstrate consistency of the Lasso method when the model is partially linear. However, we note that the algorithm tends to increase even more the number of selected false positives on partially linear models when given few training samples. That is usually because the values of small groups of samples happen to explain variation coming from the non-linear part of the response function and the noise, using a linear combination of wrong predictors. We demonstrate theoretically that false positives are likely to be selected by the Lasso method due to a small proportion of samples, which happen to explain some variation in the response variable. We show that this property implies that if we run the Lasso on several slightly smaller size data replications, sampled without replacement, and intersect the results, we are likely to reduce the number of false positives without losing already selected true positives. We propose a novel consistent variable selection algorithm based on this property and we show it can outperform other variable selection methods on synthetic datasets of linear and partially linear models and datasets from the UCI machine learning repository.\",\"PeriodicalId\":288427,\"journal\":{\"name\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2015.134\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Resampling-Based Variable Selection with Lasso for p >> n and Partially Linear Models
The linear model of the regression function is a widely used and perhaps, in most cases, highly unrealistic simplifying assumption, when proposing consistent variable selection methods for large and highly-dimensional datasets. In this paper, we study what happens from theoretical point of view, when a variable selection method assumes a linear regression function and the underlying ground-truth model is composed of a linear and a non-linear term, that is at most partially linear. We demonstrate consistency of the Lasso method when the model is partially linear. However, we note that the algorithm tends to increase even more the number of selected false positives on partially linear models when given few training samples. That is usually because the values of small groups of samples happen to explain variation coming from the non-linear part of the response function and the noise, using a linear combination of wrong predictors. We demonstrate theoretically that false positives are likely to be selected by the Lasso method due to a small proportion of samples, which happen to explain some variation in the response variable. We show that this property implies that if we run the Lasso on several slightly smaller size data replications, sampled without replacement, and intersect the results, we are likely to reduce the number of false positives without losing already selected true positives. We propose a novel consistent variable selection algorithm based on this property and we show it can outperform other variable selection methods on synthetic datasets of linear and partially linear models and datasets from the UCI machine learning repository.