A statistical view of column subset selection.

IF 3.6 1区数学 Q1 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series B-Statistical Methodology Pub Date : 2025-05-16 DOI:10.1093/jrsssb/qkaf023

Anav Sood, Trevor Hastie

{"title":"A statistical view of column subset selection.","authors":"Anav Sood, Trevor Hastie","doi":"10.1093/jrsssb/qkaf023","DOIUrl":null,"url":null,"abstract":"<p><p>We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as column subset selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of principal variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum-likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.</p>","PeriodicalId":49982,"journal":{"name":"Journal of the Royal Statistical Society Series B-Statistical Methodology","volume":" ","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12288642/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Royal Statistical Society Series B-Statistical Methodology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/jrsssb/qkaf023","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as column subset selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of principal variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum-likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.

查看原文本刊更多论文

列子集选择的统计视图。

我们考虑从大型数据集中选择代表性变量的小子集的问题。在计算机科学文献中，这种降维问题通常形式化为列子集选择（CSS）。同时，典型的统计形式化是寻找一个信息最大化的主变量集。本文证明了这两种方法是等价的，并且都可以看作是某半参数模型内的极大似然估计。在该模型中，我们建立了适当的条件，在这些条件下，CSS估计在高维上是一致的，特别是在比例渐近状态下，其中变量的数量在样本大小上收敛到一个常数。使用这些连接，我们展示了如何高效地(1)仅使用原始数据集的汇总统计数据执行CSS；(2)在存在缺失和/或删除数据的情况下执行CSS；(3)在假设检验框架中选择CSS的子集大小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the Royal Statistical Society Series B-Statistical Methodology 数学-统计学与概率论

CiteScore

8.80

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Series B (Statistical Methodology) aims to publish high quality papers on the methodological aspects of statistics and data science more broadly. The objective of papers should be to contribute to the understanding of statistical methodology and/or to develop and improve statistical methods; any mathematical theory should be directed towards these aims. The kinds of contribution considered include descriptions of new methods of collecting or analysing data, with the underlying theory, an indication of the scope of application and preferably a real example. Also considered are comparisons, critical evaluations and new applications of existing methods, contributions to probability theory which have a clear practical bearing (including the formulation and analysis of stochastic models), statistical computation or simulation where original methodology is involved and original contributions to the foundations of statistical science. Reviews of methodological techniques are also considered. A paper, even if correct and well presented, is likely to be rejected if it only presents straightforward special cases of previously published work, if it is of mathematical interest only, if it is too long in relation to the importance of the new material that it contains or if it is dominated by computations or simulations of a routine nature.