Improved Tail Bounds for Missing Mass and Confidence Intervals for Good-Turing Estimator

Prafulla Chandra, Aditya Pradeep, A. Thangaraj
{"title":"Improved Tail Bounds for Missing Mass and Confidence Intervals for Good-Turing Estimator","authors":"Prafulla Chandra, Aditya Pradeep, A. Thangaraj","doi":"10.1109/NCC.2019.8732184","DOIUrl":null,"url":null,"abstract":"The missing mass of a sequence is defined as the total probability of the elements that have not appeared or occurred in the sequence. The popular Good-Turing estimator for missing mass has been used extensively in language modeling and ecological studies. Exponential tail bounds have been known for missing mass, and improving them results in better confidence in estimation. In this work, we first show that missing mass is sub-Gamma on the right tail with the best-possible variance parameter under the Poisson and multinomial sampling models. This results in a right tail bound that beats the previously best known tail bound for deviation from mean up to about 0.2785. Further, we show that the sub-Gaussian approach cannot result in any improvement in the right tail bound for Poisson sampling. We derive confidence intervals for the Good-Turing estimator with better confidence levels and narrower width when compared to existing ones. Our results are worst case over all distributions.","PeriodicalId":6870,"journal":{"name":"2019 National Conference on Communications (NCC)","volume":"19 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2019.8732184","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

The missing mass of a sequence is defined as the total probability of the elements that have not appeared or occurred in the sequence. The popular Good-Turing estimator for missing mass has been used extensively in language modeling and ecological studies. Exponential tail bounds have been known for missing mass, and improving them results in better confidence in estimation. In this work, we first show that missing mass is sub-Gamma on the right tail with the best-possible variance parameter under the Poisson and multinomial sampling models. This results in a right tail bound that beats the previously best known tail bound for deviation from mean up to about 0.2785. Further, we show that the sub-Gaussian approach cannot result in any improvement in the right tail bound for Poisson sampling. We derive confidence intervals for the Good-Turing estimator with better confidence levels and narrower width when compared to existing ones. Our results are worst case over all distributions.
缺失质量的改进尾界和Good-Turing估计的置信区间
序列的缺失质量定义为在序列中没有出现或发生的元素的总概率。缺失质量的Good-Turing估计器在语言建模和生态学研究中得到了广泛的应用。众所周知,指数尾界会丢失质量,改进它们可以提高估计的可信度。在这项工作中,我们首先证明了在泊松和多项抽样模型下,缺失质量在右尾部具有最佳方差参数的sub-Gamma。这就产生了一个右尾界,它比之前最著名的偏离平均值的尾界高出约0.2785。进一步,我们表明,亚高斯方法不能导致泊松抽样的右尾界的任何改善。与现有的估计相比,我们获得了具有更好置信水平和更窄宽度的Good-Turing估计的置信区间。我们的结果是所有分布中最坏的情况。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信