算法子抽样的计量经济学视角

IF 11.4 2区经济学 Q1 ECONOMICS

Annual Review of Economics Pub Date : 2020-08-02 DOI:10.1146/annurev-economics-022720-114138

Sokbae Lee,Serena Ng

{"title":"算法子抽样的计量经济学视角","authors":"Sokbae Lee,Serena Ng","doi":"10.1146/annurev-economics-022720-114138","DOIUrl":null,"url":null,"abstract":"Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.","PeriodicalId":47891,"journal":{"name":"Annual Review of Economics","volume":"179 ","pages":"45-80"},"PeriodicalIF":11.4000,"publicationDate":"2020-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Econometric Perspective on Algorithmic Subsampling\",\"authors\":\"Sokbae Lee,Serena Ng\",\"doi\":\"10.1146/annurev-economics-022720-114138\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.\",\"PeriodicalId\":47891,\"journal\":{\"name\":\"Annual Review of Economics\",\"volume\":\"179 \",\"pages\":\"45-80\"},\"PeriodicalIF\":11.4000,\"publicationDate\":\"2020-08-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annual Review of Economics\",\"FirstCategoryId\":\"96\",\"ListUrlMain\":\"https://doi.org/10.1146/annurev-economics-022720-114138\",\"RegionNum\":2,\"RegionCategory\":\"经济学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ECONOMICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Review of Economics","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.1146/annurev-economics-022720-114138","RegionNum":2,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}

引用次数: 0

摘要

tb级的数据集越来越普遍，但计算机瓶颈经常阻碍对数据的完整分析，并且收益递减表明，我们可能不需要tb级的数据来估计参数或检验假设。但是我们应该分析哪几行数据，是否可以任意子集保留原始数据的特征?我们回顾了基于理论计算机科学和数值线性代数的一系列工作，发现算法上理想的草图，即随机选择的数据子集，必须保留数据的特征结构，这是一种称为子空间嵌入的性质。在这项工作的基础上，我们研究了线性回归设置中的数据草图如何影响预测和推理。我们使用统计参数来提供草图大小的“推理意识”指导，并显示在不同草图上汇集的估计器几乎与使用完整样本的不可行的估计器一样有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Econometric Perspective on Algorithmic Subsampling

Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Annual Review of Economics ECONOMICS-

CiteScore

9.70

自引率

3.60%

发文量

期刊介绍： The Annual Review of Economics covers significant developments in the field of economics, including macroeconomics and money; microeconomics, including economic psychology; international economics; public finance; health economics; education; economic growth and technological change; economic development; social economics, including culture, institutions, social interaction, and networks; game theory, political economy, and social choice; and more.