Near-optimal estimation of the unseen under regularly varying tail populations

IF 1.7 2区数学 Q2 STATISTICS & PROBABILITY

Bernoulli Pub Date : 2021-04-07 DOI:10.3150/23-bej1589

S. Favaro, Zacharie Naulet

{"title":"Near-optimal estimation of the unseen under regularly varying tail populations","authors":"S. Favaro, Zacharie Naulet","doi":"10.3150/23-bej1589","DOIUrl":null,"url":null,"abstract":"Given $n$ samples from a population of individuals belonging to different species, what is the number $U$ of hitherto unseen species that would be observed if $\\lambda n$ new samples were collected? This is an important problem in many scientific endeavors, and it has been the subject of recent works introducing non-parametric estimators of $U$ that are minimax near-optimal and consistent all the way up to $\\lambda \\asymp\\log n$. These works do not rely on any assumption on the underlying unknown distribution $p$ of the population, and therefore, while providing a theory in its greatest generality, worst-case distributions may severely hamper the estimation of $U$ in concrete applications. In this paper, we consider the problem of strengthening the non-parametric framework for estimating $U$. Inspired by the estimation of rare probabilities in extreme value theory, and motivated by the ubiquitous power-law type distributions in many natural and social phenomena, we make use of a semi-parametric assumption regular variation of index $\\alpha \\in (0,1)$ for the tail behaviour of $p$. Under this assumption, we introduce an estimator of $U$ that is simple, linear in the sampling information, computationally efficient, and scalable to massive datasets. Then, uniformly over our class of regularly varying tail distributions, we show that the proposed estimator has provable guarantees: i) it is minimax near-optimal, up to a power of $\\log n$ factor; ii) it is consistent all of the way up to $\\log\\lambda \\asymp n^{\\alpha/2}/\\sqrt{\\log n}$, and this range is the best possible. This work presents the first study on the estimation of the unseen under regularly varying tail distributions. A numerical illustration of our methodology is presented for synthetic data and real data.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2021-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bernoulli","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.3150/23-bej1589","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 8

Abstract

Given $n$ samples from a population of individuals belonging to different species, what is the number $U$ of hitherto unseen species that would be observed if $\lambda n$ new samples were collected? This is an important problem in many scientific endeavors, and it has been the subject of recent works introducing non-parametric estimators of $U$ that are minimax near-optimal and consistent all the way up to $\lambda \asymp\log n$. These works do not rely on any assumption on the underlying unknown distribution $p$ of the population, and therefore, while providing a theory in its greatest generality, worst-case distributions may severely hamper the estimation of $U$ in concrete applications. In this paper, we consider the problem of strengthening the non-parametric framework for estimating $U$. Inspired by the estimation of rare probabilities in extreme value theory, and motivated by the ubiquitous power-law type distributions in many natural and social phenomena, we make use of a semi-parametric assumption regular variation of index $\alpha \in (0,1)$ for the tail behaviour of $p$. Under this assumption, we introduce an estimator of $U$ that is simple, linear in the sampling information, computationally efficient, and scalable to massive datasets. Then, uniformly over our class of regularly varying tail distributions, we show that the proposed estimator has provable guarantees: i) it is minimax near-optimal, up to a power of $\log n$ factor; ii) it is consistent all of the way up to $\log\lambda \asymp n^{\alpha/2}/\sqrt{\log n}$, and this range is the best possible. This work presents the first study on the estimation of the unseen under regularly varying tail distributions. A numerical illustration of our methodology is presented for synthetic data and real data.

查看原文本刊更多论文

在有规律变化的尾部种群下对未见情况的近最优估计

给定来自属于不同物种的个体群体的$n$样本，如果收集$λn$新样本，将观察到迄今为止未发现的物种的数量$U$是多少？这是许多科学工作中的一个重要问题，也是最近引入$U$的非参数估计量的主题，这些估计量是最接近最优的极小极大值，并且一直到$\lamba\asymp\logn$都是一致的。这些工作不依赖于对人口潜在未知分布$p$的任何假设，因此，在提供最具普遍性的理论的同时，最坏情况下的分布可能会严重阻碍具体应用中$U$的估计。在本文中，我们考虑了加强估计$U$的非参数框架的问题。受极值理论中罕见概率估计的启发，并受许多自然和社会现象中普遍存在的幂律型分布的激励，我们对$p$的尾部行为使用了指数$\alpha\in（0,1）$的半参数假设正则变化。在这个假设下，我们引入了一个$U$的估计器，它简单、采样信息线性、计算高效，并且可扩展到大规模数据集。然后，在我们的一类正则变化尾分布上，我们一致地证明了所提出的估计器具有可证明的保证：i）它是接近最优的极小极大值，高达$\logn$因子的幂；ii）它一直到$\log\lamba\asymp n^｛\alpha/2｝/\sqrt｛\log n｝$都是一致的，并且这个范围是最好的。这项工作首次研究了在规则变化的尾部分布下不可见的估计。对于合成数据和实际数据，给出了我们方法的数值说明。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bernoulli 数学-统计学与概率论

CiteScore

3.40

自引率

0.00%

发文量

116

审稿时长

6-12 weeks

期刊介绍： BERNOULLI is the journal of the Bernoulli Society for Mathematical Statistics and Probability, issued four times per year. The journal provides a comprehensive account of important developments in the fields of statistics and probability, offering an international forum for both theoretical and applied work. BERNOULLI will publish: Papers containing original and significant research contributions: with background, mathematical derivation and discussion of the results in suitable detail and, where appropriate, with discussion of interesting applications in relation to the methodology proposed. Papers of the following two types will also be considered for publication, provided they are judged to enhance the dissemination of research: Review papers which provide an integrated critical survey of some area of probability and statistics and discuss important recent developments. Scholarly written papers on some historical significant aspect of statistics and probability.