{"title":"Improved Tail Bounds for Missing Mass and Confidence Intervals for Good-Turing Estimator","authors":"Prafulla Chandra, Aditya Pradeep, A. Thangaraj","doi":"10.1109/NCC.2019.8732184","DOIUrl":null,"url":null,"abstract":"The missing mass of a sequence is defined as the total probability of the elements that have not appeared or occurred in the sequence. The popular Good-Turing estimator for missing mass has been used extensively in language modeling and ecological studies. Exponential tail bounds have been known for missing mass, and improving them results in better confidence in estimation. In this work, we first show that missing mass is sub-Gamma on the right tail with the best-possible variance parameter under the Poisson and multinomial sampling models. This results in a right tail bound that beats the previously best known tail bound for deviation from mean up to about 0.2785. Further, we show that the sub-Gaussian approach cannot result in any improvement in the right tail bound for Poisson sampling. We derive confidence intervals for the Good-Turing estimator with better confidence levels and narrower width when compared to existing ones. Our results are worst case over all distributions.","PeriodicalId":6870,"journal":{"name":"2019 National Conference on Communications (NCC)","volume":"19 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2019.8732184","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
The missing mass of a sequence is defined as the total probability of the elements that have not appeared or occurred in the sequence. The popular Good-Turing estimator for missing mass has been used extensively in language modeling and ecological studies. Exponential tail bounds have been known for missing mass, and improving them results in better confidence in estimation. In this work, we first show that missing mass is sub-Gamma on the right tail with the best-possible variance parameter under the Poisson and multinomial sampling models. This results in a right tail bound that beats the previously best known tail bound for deviation from mean up to about 0.2785. Further, we show that the sub-Gaussian approach cannot result in any improvement in the right tail bound for Poisson sampling. We derive confidence intervals for the Good-Turing estimator with better confidence levels and narrower width when compared to existing ones. Our results are worst case over all distributions.