稀疏建模和套索

Computer Age Statistical Inference, Student Edition Pub Date : 2016-07-01 DOI:10.1017/CBO9781316576533.017

B. Efron, T. Hastie

{"title":"稀疏建模和套索","authors":"B. Efron, T. Hastie","doi":"10.1017/CBO9781316576533.017","DOIUrl":null,"url":null,"abstract":"The amount of data we are faced with keeps growing. From around the late 1990s we started to see wide data sets, where the number of variables far exceeds the number of observations. This was largely due to our increasing ability to measure a large amount of information automatically. In genomics, for example, we can use a high-throughput experiment to automatically measure the expression of tens of thousands of genes in a sample in a short amount of time. Similarly, sequencing equipment allows us to genotype millions of SNPs (single-nucleotide polymorphisms) cheaply and quickly. In document retrieval and modeling, we represent a document by the presence or count of each word in the dictionary. This easily leads to a feature vector with 20,000 components, one for each distinct vocabulary word, although most would be zero for a small document. If we move to bi-grams or higher, the feature space gets really large. In even more modest situations, we can be faced with hundreds of variables. If these variables are to be predictors in a regression or logistic regression model, we probably do not want to use them all. It is likely that a subset will do the job well, and including all the redundant variables will degrade our fit. Hence we are often interested in identifying a good subset of variables. Note also that in these wide-data situations, even linear models are over-parametrized, so some form of reduction or regularization is essential. In this chapter we will discuss some of the popular methods for model selection, starting with the time-tested and worthy forward-stepwise approach. We then look at the lasso, a popular modern method that does selection and shrinkage via convex optimization. The LARs algorithm ties these two approaches together, and leads to methods that can deliver paths of solutions. Finally, we discuss some connections with other modern big-and widedata approaches, and mention some extensions. Forward Stepwise Regression Stepwise procedures have been around for a very long time. They were originally devised in times when data sets were quite modest in size, in particular in terms of the number of variables. Originally thought of as the poor cousins of “best-subset” selection, they had the advantage of being much cheaper to compute (and in fact possible to compute for large p).We will review best-subset regression first.","PeriodicalId":430973,"journal":{"name":"Computer Age Statistical Inference, Student Edition","volume":"111 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sparse Modeling and the Lasso\",\"authors\":\"B. Efron, T. Hastie\",\"doi\":\"10.1017/CBO9781316576533.017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The amount of data we are faced with keeps growing. From around the late 1990s we started to see wide data sets, where the number of variables far exceeds the number of observations. This was largely due to our increasing ability to measure a large amount of information automatically. In genomics, for example, we can use a high-throughput experiment to automatically measure the expression of tens of thousands of genes in a sample in a short amount of time. Similarly, sequencing equipment allows us to genotype millions of SNPs (single-nucleotide polymorphisms) cheaply and quickly. In document retrieval and modeling, we represent a document by the presence or count of each word in the dictionary. This easily leads to a feature vector with 20,000 components, one for each distinct vocabulary word, although most would be zero for a small document. If we move to bi-grams or higher, the feature space gets really large. In even more modest situations, we can be faced with hundreds of variables. If these variables are to be predictors in a regression or logistic regression model, we probably do not want to use them all. It is likely that a subset will do the job well, and including all the redundant variables will degrade our fit. Hence we are often interested in identifying a good subset of variables. Note also that in these wide-data situations, even linear models are over-parametrized, so some form of reduction or regularization is essential. In this chapter we will discuss some of the popular methods for model selection, starting with the time-tested and worthy forward-stepwise approach. We then look at the lasso, a popular modern method that does selection and shrinkage via convex optimization. The LARs algorithm ties these two approaches together, and leads to methods that can deliver paths of solutions. Finally, we discuss some connections with other modern big-and widedata approaches, and mention some extensions. Forward Stepwise Regression Stepwise procedures have been around for a very long time. They were originally devised in times when data sets were quite modest in size, in particular in terms of the number of variables. Originally thought of as the poor cousins of “best-subset” selection, they had the advantage of being much cheaper to compute (and in fact possible to compute for large p).We will review best-subset regression first.\",\"PeriodicalId\":430973,\"journal\":{\"name\":\"Computer Age Statistical Inference, Student Edition\",\"volume\":\"111 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Age Statistical Inference, Student Edition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1017/CBO9781316576533.017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Age Statistical Inference, Student Edition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/CBO9781316576533.017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们面临的数据量不断增长。从20世纪90年代末开始，我们开始看到大量的数据集，其中变量的数量远远超过了观测的数量。这主要是由于我们自动测量大量信息的能力不断增强。例如，在基因组学中，我们可以使用高通量实验在短时间内自动测量样本中数万个基因的表达。同样，测序设备使我们能够廉价而快速地对数百万个单核苷酸多态性进行基因分型。在文档检索和建模中，我们通过字典中每个单词的存在或计数来表示文档。这很容易导致一个包含20,000个组件的特征向量，每个组件对应一个不同的词汇，尽管对于一个小文档来说，大多数组件都是零。如果我们转向双图或更高的图，特征空间会变得非常大。在更温和的情况下，我们可能面临数百个变量。如果这些变量是回归或逻辑回归模型中的预测因子，我们可能不希望全部使用它们。一个子集很可能会做得很好，而包括所有冗余变量会降低我们的拟合度。因此，我们经常对确定变量的一个好的子集感兴趣。还要注意，在这些大数据情况下，即使是线性模型也会过度参数化，因此某种形式的简化或正则化是必不可少的。在本章中，我们将讨论一些流行的模型选择方法，从经过时间考验和有价值的前向逐步方法开始。然后我们看看套索，这是一种流行的现代方法，通过凸优化来进行选择和收缩。LARs算法将这两种方法结合在一起，并导致可以提供解决方案路径的方法。最后，我们讨论了与其他现代大数据和宽数据方法的一些联系，并提到了一些扩展。逐步回归法已经存在很长时间了。它们最初是在数据集规模相当有限的时候设计的，尤其是在变量数量方面。最初被认为是“最佳子集”选择的穷亲戚，它们的优点是计算成本更低(实际上可以计算较大的p)。我们将首先回顾最佳子集回归。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Sparse Modeling and the Lasso

The amount of data we are faced with keeps growing. From around the late 1990s we started to see wide data sets, where the number of variables far exceeds the number of observations. This was largely due to our increasing ability to measure a large amount of information automatically. In genomics, for example, we can use a high-throughput experiment to automatically measure the expression of tens of thousands of genes in a sample in a short amount of time. Similarly, sequencing equipment allows us to genotype millions of SNPs (single-nucleotide polymorphisms) cheaply and quickly. In document retrieval and modeling, we represent a document by the presence or count of each word in the dictionary. This easily leads to a feature vector with 20,000 components, one for each distinct vocabulary word, although most would be zero for a small document. If we move to bi-grams or higher, the feature space gets really large. In even more modest situations, we can be faced with hundreds of variables. If these variables are to be predictors in a regression or logistic regression model, we probably do not want to use them all. It is likely that a subset will do the job well, and including all the redundant variables will degrade our fit. Hence we are often interested in identifying a good subset of variables. Note also that in these wide-data situations, even linear models are over-parametrized, so some form of reduction or regularization is essential. In this chapter we will discuss some of the popular methods for model selection, starting with the time-tested and worthy forward-stepwise approach. We then look at the lasso, a popular modern method that does selection and shrinkage via convex optimization. The LARs algorithm ties these two approaches together, and leads to methods that can deliver paths of solutions. Finally, we discuss some connections with other modern big-and widedata approaches, and mention some extensions. Forward Stepwise Regression Stepwise procedures have been around for a very long time. They were originally devised in times when data sets were quite modest in size, in particular in terms of the number of variables. Originally thought of as the poor cousins of “best-subset” selection, they had the advantage of being much cheaper to compute (and in fact possible to compute for large p).We will review best-subset regression first.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Age Statistical Inference, Student Edition

自引率

0.00%

发文量