Subdata Selection With a Large Number of Variables

The New England Journal of Statistics in Data Science Pub Date : 2023-01-01 DOI:10.51387/23-nejsds36

Rakhi Singh, J. Stufken

{"title":"Subdata Selection With a Large Number of Variables","authors":"Rakhi Singh, J. Stufken","doi":"10.51387/23-nejsds36","DOIUrl":null,"url":null,"abstract":"Subdata selection from big data is an active area of research that facilitates inferences based on big data with limited computational expense. For linear regression models, the optimal design-inspired Information-Based Optimal Subdata Selection (IBOSS) method is a computationally efficient method for selecting subdata that has excellent statistical properties. But the method can only be used if the subdata size, k, is at last twice the number of regression variables, p. In addition, even when $k\\ge 2p$, under the assumption of effect sparsity, one can expect to obtain subdata with better statistical properties by trying to focus on active variables. Inspired by recent efforts to extend the IBOSS method to situations with a large number of variables p, we introduce a method called Combining Lasso And Subdata Selection (CLASS) that, as shown, improves on other proposed methods in terms of variable selection and building a predictive model based on subdata when the full data size n is very large and the number of variables p is large. In terms of computational expense, CLASS is more expensive than recent competitors for moderately large values of n, but the roles reverse under effect sparsity for extremely large values of n.","PeriodicalId":94360,"journal":{"name":"The New England Journal of Statistics in Data Science","volume":"36 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The New England Journal of Statistics in Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.51387/23-nejsds36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Subdata selection from big data is an active area of research that facilitates inferences based on big data with limited computational expense. For linear regression models, the optimal design-inspired Information-Based Optimal Subdata Selection (IBOSS) method is a computationally efficient method for selecting subdata that has excellent statistical properties. But the method can only be used if the subdata size, k, is at last twice the number of regression variables, p. In addition, even when $k\ge 2p$, under the assumption of effect sparsity, one can expect to obtain subdata with better statistical properties by trying to focus on active variables. Inspired by recent efforts to extend the IBOSS method to situations with a large number of variables p, we introduce a method called Combining Lasso And Subdata Selection (CLASS) that, as shown, improves on other proposed methods in terms of variable selection and building a predictive model based on subdata when the full data size n is very large and the number of variables p is large. In terms of computational expense, CLASS is more expensive than recent competitors for moderately large values of n, but the roles reverse under effect sparsity for extremely large values of n.

查看原文本刊更多论文

具有大量变量的子数据选择

从大数据中选择子数据是一个活跃的研究领域，它可以在有限的计算成本下促进基于大数据的推断。对于线性回归模型，基于优化设计的信息优化子数据选择(Information-Based optimal Subdata Selection, IBOSS)方法是一种计算效率很高的方法，用于选择具有良好统计特性的子数据。但是，只有当子数据大小k至少是回归变量数量p的两倍时，才可以使用该方法。此外，即使在效应稀疏性假设下，也可以期望通过尝试关注活动变量来获得具有更好统计性质的子数据。受最近将IBOSS方法扩展到具有大量变量p的情况的努力的启发，我们引入了一种称为结合Lasso和子数据选择(CLASS)的方法，如图所示，该方法在变量选择和基于子数据构建预测模型方面改进了其他提出的方法，当完整数据大小n非常大且变量数量p很大时。就计算费用而言，对于中等较大的n值，CLASS比最近的竞争对手更昂贵，但是对于极大的n值，在效果稀疏性下，角色颠倒了。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The New England Journal of Statistics in Data Science

自引率

0.00%

发文量