{"title":"算法子抽样的计量经济学视角","authors":"Sokbae Lee,Serena Ng","doi":"10.1146/annurev-economics-022720-114138","DOIUrl":null,"url":null,"abstract":"Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.","PeriodicalId":47891,"journal":{"name":"Annual Review of Economics","volume":"179 ","pages":"45-80"},"PeriodicalIF":6.8000,"publicationDate":"2020-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Econometric Perspective on Algorithmic Subsampling\",\"authors\":\"Sokbae Lee,Serena Ng\",\"doi\":\"10.1146/annurev-economics-022720-114138\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.\",\"PeriodicalId\":47891,\"journal\":{\"name\":\"Annual Review of Economics\",\"volume\":\"179 \",\"pages\":\"45-80\"},\"PeriodicalIF\":6.8000,\"publicationDate\":\"2020-08-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annual Review of Economics\",\"FirstCategoryId\":\"96\",\"ListUrlMain\":\"https://doi.org/10.1146/annurev-economics-022720-114138\",\"RegionNum\":2,\"RegionCategory\":\"经济学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ECONOMICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Review of Economics","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.1146/annurev-economics-022720-114138","RegionNum":2,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}
An Econometric Perspective on Algorithmic Subsampling
Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.
期刊介绍:
The Annual Review of Economics covers significant developments in the field of economics, including macroeconomics and money; microeconomics, including economic psychology; international economics; public finance; health economics; education; economic growth and technological change; economic development; social economics, including culture, institutions, social interaction, and networks; game theory, political economy, and social choice; and more.