Fadhel Ayed, M. Battiston, F. Camerlenghi, S. Favaro
{"title":"On consistent and rate optimal estimation of the missing mass","authors":"Fadhel Ayed, M. Battiston, F. Camerlenghi, S. Favaro","doi":"10.1214/20-AIHP1126","DOIUrl":null,"url":null,"abstract":". Given n samples from a population of individuals belonging to different types with unknown proportions, how do we estimate the probability of discovering a new type at the ( n + 1)-th draw? This is a classical problem in statistics, commonly referred to as the missing mass estimation problem. Recent results have shown: i) the impossibility of estimating the missing mass without imposing further assumptions on type’s proportions; ii) the consistency of the Good-Turing estimator of the missing mass under the assumption that the tail of type’s proportions decays to zero as a regularly varying function with parameter α ∈ (0 , 1); ii) the rate of convergence n − α/ 2 for the Good-Turing estimator under the class of α ∈ (0 , 1) regularly varying P . In this paper we introduce an alternative, and remarkably shorter, proof of the impossibility of a distribution-free estimation of the missing mass. Beside being of independent interest, our alternative proof suggests a natural approach to strengthen, and expand, the recent results on the rate of convergence of the Good-Turing estimator under α ∈ (0 , 1) regularly varying type’s proportions. In particular, we show that the convergence rate n − α/ 2 is the best rate that any estimator can achieve, up to a slowly varying function. Furthermore, we prove that a lower bound to the minimax estimation risk must scale at least as n − α/ 2 , which leads to conjecture that the Good-Turing estimator is a rate optimal minimax estimator under regularly varying type proportions.","PeriodicalId":42884,"journal":{"name":"Annales de l Institut Henri Poincare D","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2021-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annales de l Institut Henri Poincare D","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1214/20-AIHP1126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PHYSICS, MATHEMATICAL","Score":null,"Total":0}
引用次数: 8
Abstract
. Given n samples from a population of individuals belonging to different types with unknown proportions, how do we estimate the probability of discovering a new type at the ( n + 1)-th draw? This is a classical problem in statistics, commonly referred to as the missing mass estimation problem. Recent results have shown: i) the impossibility of estimating the missing mass without imposing further assumptions on type’s proportions; ii) the consistency of the Good-Turing estimator of the missing mass under the assumption that the tail of type’s proportions decays to zero as a regularly varying function with parameter α ∈ (0 , 1); ii) the rate of convergence n − α/ 2 for the Good-Turing estimator under the class of α ∈ (0 , 1) regularly varying P . In this paper we introduce an alternative, and remarkably shorter, proof of the impossibility of a distribution-free estimation of the missing mass. Beside being of independent interest, our alternative proof suggests a natural approach to strengthen, and expand, the recent results on the rate of convergence of the Good-Turing estimator under α ∈ (0 , 1) regularly varying type’s proportions. In particular, we show that the convergence rate n − α/ 2 is the best rate that any estimator can achieve, up to a slowly varying function. Furthermore, we prove that a lower bound to the minimax estimation risk must scale at least as n − α/ 2 , which leads to conjecture that the Good-Turing estimator is a rate optimal minimax estimator under regularly varying type proportions.