Estimating the Unseen

Journal of the ACM (JACM) Pub Date : 2017-10-04 DOI:10.1145/3125643

Paul Valiant, G. Valiant

{"title":"Estimating the Unseen","authors":"Paul Valiant, G. Valiant","doi":"10.1145/3125643","DOIUrl":null,"url":null,"abstract":"We show that a class of statistical properties of distributions, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most k distinct elements, these properties can be estimated accurately using a sample of size O(k log k). For these estimation tasks, this performance is optimal, to constant factors. Complementing these theoretical results, we also demonstrate that our estimators perform exceptionally well, in practice, for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. The key step in our approach is to first use the sample to characterize the “unseen” portion of the distribution—effectively reconstructing this portion of the distribution as accurately as if one had a logarithmic factor larger sample. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: We seek to estimate the shape of the unobserved portion of the distribution. This work can be seen as introducing a robust, general, and theoretically principled framework that, for many practical applications, essentially amplifies the sample size by a logarithmic factor; we expect that it may be fruitfully used as a component within larger machine learning and statistical analysis systems.","PeriodicalId":17199,"journal":{"name":"Journal of the ACM (JACM)","volume":"28 1","pages":"1 - 41"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the ACM (JACM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3125643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

Abstract

We show that a class of statistical properties of distributions, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most k distinct elements, these properties can be estimated accurately using a sample of size O(k log k). For these estimation tasks, this performance is optimal, to constant factors. Complementing these theoretical results, we also demonstrate that our estimators perform exceptionally well, in practice, for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. The key step in our approach is to first use the sample to characterize the “unseen” portion of the distribution—effectively reconstructing this portion of the distribution as accurately as if one had a logarithmic factor larger sample. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: We seek to estimate the shape of the unobserved portion of the distribution. This work can be seen as introducing a robust, general, and theoretically principled framework that, for many practical applications, essentially amplifies the sample size by a logarithmic factor; we expect that it may be fruitfully used as a component within larger machine learning and statistical analysis systems.

查看原文本刊更多论文

估计看不见的东西

我们证明了一类分布的统计性质，其中包括实际相关的性质，如熵，不同元素的数量，分布对之间的距离度量，可以估计给定一个次线性大小的样本。具体来说，给定一个样本，由最多k个不同元素的任何分布的独立绘图组成，这些属性可以使用大小为O(k log k)的样本进行准确估计。对于这些估计任务，这种性能对于常数因素是最优的。作为这些理论结果的补充，我们还证明了我们的估计器在实践中对于各种估计任务、各种自然分布、各种参数都表现得非常好。我们方法的关键步骤是首先使用样本来描述分布的“看不见的”部分——有效地重建分布的这一部分，就像一个对数因子更大的样本一样准确。这超越了像Good-Turing频率估计方案这样的工具，它估计分布中未观察到的部分的总概率质量:我们试图估计分布中未观察到的部分的形状。这项工作可以看作是引入了一个强大的，一般的，理论上有原则的框架，对于许多实际应用，本质上是通过对数因子放大样本量;我们期望它作为一个组件在更大的机器学习和统计分析系统中得到有效的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the ACM (JACM)

自引率

0.00%

发文量