{"title":"An Econometric Perspective on Algorithmic Subsampling","authors":"Sokbae Lee,Serena Ng","doi":"10.1146/annurev-economics-022720-114138","DOIUrl":null,"url":null,"abstract":"Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.","PeriodicalId":47891,"journal":{"name":"Annual Review of Economics","volume":null,"pages":null},"PeriodicalIF":6.8000,"publicationDate":"2020-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Review of Economics","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.1146/annurev-economics-022720-114138","RegionNum":2,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}
引用次数: 0
Abstract
Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.
期刊介绍:
The Annual Review of Economics covers significant developments in the field of economics, including macroeconomics and money; microeconomics, including economic psychology; international economics; public finance; health economics; education; economic growth and technological change; economic development; social economics, including culture, institutions, social interaction, and networks; game theory, political economy, and social choice; and more.