An Econometric Perspective on Algorithmic Subsampling

IF 6.8 2区 经济学 Q1 ECONOMICS
Sokbae Lee,Serena Ng
{"title":"An Econometric Perspective on Algorithmic Subsampling","authors":"Sokbae Lee,Serena Ng","doi":"10.1146/annurev-economics-022720-114138","DOIUrl":null,"url":null,"abstract":"Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.","PeriodicalId":47891,"journal":{"name":"Annual Review of Economics","volume":null,"pages":null},"PeriodicalIF":6.8000,"publicationDate":"2020-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Review of Economics","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.1146/annurev-economics-022720-114138","RegionNum":2,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}
引用次数: 0

Abstract

Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.
算法子抽样的计量经济学视角
tb级的数据集越来越普遍,但计算机瓶颈经常阻碍对数据的完整分析,并且收益递减表明,我们可能不需要tb级的数据来估计参数或检验假设。但是我们应该分析哪几行数据,是否可以任意子集保留原始数据的特征?我们回顾了基于理论计算机科学和数值线性代数的一系列工作,发现算法上理想的草图,即随机选择的数据子集,必须保留数据的特征结构,这是一种称为子空间嵌入的性质。在这项工作的基础上,我们研究了线性回归设置中的数据草图如何影响预测和推理。我们使用统计参数来提供草图大小的“推理意识”指导,并显示在不同草图上汇集的估计器几乎与使用完整样本的不可行的估计器一样有效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
9.70
自引率
3.60%
发文量
34
期刊介绍: The Annual Review of Economics covers significant developments in the field of economics, including macroeconomics and money; microeconomics, including economic psychology; international economics; public finance; health economics; education; economic growth and technological change; economic development; social economics, including culture, institutions, social interaction, and networks; game theory, political economy, and social choice; and more.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信