HyperLogLog的基于位数组的替代品

IF 1 4区 计算机科学 Q3 COMPUTER SCIENCE, THEORY & METHODS
Svante Janson , Jérémie Lumbroso , Robert Sedgewick
{"title":"HyperLogLog的基于位数组的替代品","authors":"Svante Janson ,&nbsp;Jérémie Lumbroso ,&nbsp;Robert Sedgewick","doi":"10.1016/j.tcs.2025.115450","DOIUrl":null,"url":null,"abstract":"<div><div>We present a family of algorithms for the problem of estimating the number of distinct items in an input stream that are simple to implement and are appropriate for practical applications. Our algorithms are a logical extension of the series of algorithms developed by Flajolet and his coauthors starting in 1983 that culminated in the widely used <span>HyperLogLog</span> algorithm. These algorithms divide the input stream into <em>M</em> substreams and lead to a time-accuracy tradeoff where a small number of bits per substream are saved to achieve a relative accuracy proportional to <span><math><mn>1</mn><mo>/</mo><msqrt><mrow><mi>M</mi></mrow></msqrt></math></span>. Our algorithms use just one or two bits per substream. Their effectiveness is demonstrated by a proof of approximate normality, with explicit expressions for standard errors that inform parameter settings and allow proper quantitative comparisons with other methods. Performance hypotheses are validated through experiments using a realistic input stream, with the general conclusion that our algorithms are significantly more accurate than <span>HyperLogLog</span> when using the same amount of memory, and they use significantly less memory than <span>HyperLogLog</span> to achieve a given accuracy.</div></div>","PeriodicalId":49438,"journal":{"name":"Theoretical Computer Science","volume":"1054 ","pages":"Article 115450"},"PeriodicalIF":1.0000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bit-array-based alternatives to HyperLogLog\",\"authors\":\"Svante Janson ,&nbsp;Jérémie Lumbroso ,&nbsp;Robert Sedgewick\",\"doi\":\"10.1016/j.tcs.2025.115450\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>We present a family of algorithms for the problem of estimating the number of distinct items in an input stream that are simple to implement and are appropriate for practical applications. Our algorithms are a logical extension of the series of algorithms developed by Flajolet and his coauthors starting in 1983 that culminated in the widely used <span>HyperLogLog</span> algorithm. These algorithms divide the input stream into <em>M</em> substreams and lead to a time-accuracy tradeoff where a small number of bits per substream are saved to achieve a relative accuracy proportional to <span><math><mn>1</mn><mo>/</mo><msqrt><mrow><mi>M</mi></mrow></msqrt></math></span>. Our algorithms use just one or two bits per substream. Their effectiveness is demonstrated by a proof of approximate normality, with explicit expressions for standard errors that inform parameter settings and allow proper quantitative comparisons with other methods. Performance hypotheses are validated through experiments using a realistic input stream, with the general conclusion that our algorithms are significantly more accurate than <span>HyperLogLog</span> when using the same amount of memory, and they use significantly less memory than <span>HyperLogLog</span> to achieve a given accuracy.</div></div>\",\"PeriodicalId\":49438,\"journal\":{\"name\":\"Theoretical Computer Science\",\"volume\":\"1054 \",\"pages\":\"Article 115450\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2025-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Theoretical Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0304397525003883\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Theoretical Computer Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0304397525003883","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

摘要

我们提出了一组算法来估计输入流中不同项目的数量,这些算法易于实现并且适合实际应用。我们的算法是Flajolet和他的合作者从1983年开始开发的一系列算法的逻辑扩展,这些算法在广泛使用的HyperLogLog算法中达到了高潮。这些算法将输入流划分为M个子流,并导致时间精度权衡,其中每个子流保存少量比特,以实现与1/M成比例的相对精度。我们的算法每个子流只使用一到两个比特。它们的有效性通过近似正态性的证明来证明,标准误差的显式表达式告知参数设置,并允许与其他方法进行适当的定量比较。通过使用实际输入流的实验验证了性能假设,得出的一般结论是,当使用相同数量的内存时,我们的算法明显比HyperLogLog更准确,并且它们比HyperLogLog使用更少的内存来实现给定的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Bit-array-based alternatives to HyperLogLog
We present a family of algorithms for the problem of estimating the number of distinct items in an input stream that are simple to implement and are appropriate for practical applications. Our algorithms are a logical extension of the series of algorithms developed by Flajolet and his coauthors starting in 1983 that culminated in the widely used HyperLogLog algorithm. These algorithms divide the input stream into M substreams and lead to a time-accuracy tradeoff where a small number of bits per substream are saved to achieve a relative accuracy proportional to 1/M. Our algorithms use just one or two bits per substream. Their effectiveness is demonstrated by a proof of approximate normality, with explicit expressions for standard errors that inform parameter settings and allow proper quantitative comparisons with other methods. Performance hypotheses are validated through experiments using a realistic input stream, with the general conclusion that our algorithms are significantly more accurate than HyperLogLog when using the same amount of memory, and they use significantly less memory than HyperLogLog to achieve a given accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Theoretical Computer Science
Theoretical Computer Science 工程技术-计算机:理论方法
CiteScore
2.60
自引率
18.20%
发文量
471
审稿时长
12.6 months
期刊介绍: Theoretical Computer Science is mathematical and abstract in spirit, but it derives its motivation from practical and everyday computation. Its aim is to understand the nature of computation and, as a consequence of this understanding, provide more efficient methodologies. All papers introducing or studying mathematical, logic and formal concepts and methods are welcome, provided that their motivation is clearly drawn from the field of computing.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信