HyperLogLog的基于位数组的替代品

IF 1 4区计算机科学 Q3 COMPUTER SCIENCE, THEORY & METHODS

Theoretical Computer Science Pub Date : 2025-07-09 DOI:10.1016/j.tcs.2025.115450

Svante Janson , Jérémie Lumbroso , Robert Sedgewick

{"title":"HyperLogLog的基于位数组的替代品","authors":"Svante Janson , Jérémie Lumbroso , Robert Sedgewick","doi":"10.1016/j.tcs.2025.115450","DOIUrl":null,"url":null,"abstract":"<div><div>We present a family of algorithms for the problem of estimating the number of distinct items in an input stream that are simple to implement and are appropriate for practical applications. Our algorithms are a logical extension of the series of algorithms developed by Flajolet and his coauthors starting in 1983 that culminated in the widely used <span>HyperLogLog</span> algorithm. These algorithms divide the input stream into <em>M</em> substreams and lead to a time-accuracy tradeoff where a small number of bits per substream are saved to achieve a relative accuracy proportional to <span><math><mn>1</mn><mo>/</mo><msqrt><mrow><mi>M</mi></mrow></msqrt></math></span>. Our algorithms use just one or two bits per substream. Their effectiveness is demonstrated by a proof of approximate normality, with explicit expressions for standard errors that inform parameter settings and allow proper quantitative comparisons with other methods. Performance hypotheses are validated through experiments using a realistic input stream, with the general conclusion that our algorithms are significantly more accurate than <span>HyperLogLog</span> when using the same amount of memory, and they use significantly less memory than <span>HyperLogLog</span> to achieve a given accuracy.</div></div>","PeriodicalId":49438,"journal":{"name":"Theoretical Computer Science","volume":"1054 ","pages":"Article 115450"},"PeriodicalIF":1.0000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bit-array-based alternatives to HyperLogLog\",\"authors\":\"Svante Janson , Jérémie Lumbroso , Robert Sedgewick\",\"doi\":\"10.1016/j.tcs.2025.115450\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>We present a family of algorithms for the problem of estimating the number of distinct items in an input stream that are simple to implement and are appropriate for practical applications. Our algorithms are a logical extension of the series of algorithms developed by Flajolet and his coauthors starting in 1983 that culminated in the widely used <span>HyperLogLog</span> algorithm. These algorithms divide the input stream into <em>M</em> substreams and lead to a time-accuracy tradeoff where a small number of bits per substream are saved to achieve a relative accuracy proportional to <span><math><mn>1</mn><mo>/</mo><msqrt><mrow><mi>M</mi></mrow></msqrt></math></span>. Our algorithms use just one or two bits per substream. Their effectiveness is demonstrated by a proof of approximate normality, with explicit expressions for standard errors that inform parameter settings and allow proper quantitative comparisons with other methods. Performance hypotheses are validated through experiments using a realistic input stream, with the general conclusion that our algorithms are significantly more accurate than <span>HyperLogLog</span> when using the same amount of memory, and they use significantly less memory than <span>HyperLogLog</span> to achieve a given accuracy.</div></div>\",\"PeriodicalId\":49438,\"journal\":{\"name\":\"Theoretical Computer Science\",\"volume\":\"1054 \",\"pages\":\"Article 115450\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2025-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Theoretical Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0304397525003883\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Theoretical Computer Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0304397525003883","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

我们提出了一组算法来估计输入流中不同项目的数量，这些算法易于实现并且适合实际应用。我们的算法是Flajolet和他的合作者从1983年开始开发的一系列算法的逻辑扩展，这些算法在广泛使用的HyperLogLog算法中达到了高潮。这些算法将输入流划分为M个子流，并导致时间精度权衡，其中每个子流保存少量比特，以实现与1/M成比例的相对精度。我们的算法每个子流只使用一到两个比特。它们的有效性通过近似正态性的证明来证明，标准误差的显式表达式告知参数设置，并允许与其他方法进行适当的定量比较。通过使用实际输入流的实验验证了性能假设，得出的一般结论是，当使用相同数量的内存时，我们的算法明显比HyperLogLog更准确，并且它们比HyperLogLog使用更少的内存来实现给定的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bit-array-based alternatives to HyperLogLog

We present a family of algorithms for the problem of estimating the number of distinct items in an input stream that are simple to implement and are appropriate for practical applications. Our algorithms are a logical extension of the series of algorithms developed by Flajolet and his coauthors starting in 1983 that culminated in the widely used HyperLogLog algorithm. These algorithms divide the input stream into M substreams and lead to a time-accuracy tradeoff where a small number of bits per substream are saved to achieve a relative accuracy proportional to

1 / \sqrt{M}

. Our algorithms use just one or two bits per substream. Their effectiveness is demonstrated by a proof of approximate normality, with explicit expressions for standard errors that inform parameter settings and allow proper quantitative comparisons with other methods. Performance hypotheses are validated through experiments using a realistic input stream, with the general conclusion that our algorithms are significantly more accurate than HyperLogLog when using the same amount of memory, and they use significantly less memory than HyperLogLog to achieve a given accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Theoretical Computer Science 工程技术-计算机：理论方法

CiteScore

2.60

自引率

18.20%

发文量

471

审稿时长

12.6 months

期刊介绍： Theoretical Computer Science is mathematical and abstract in spirit, but it derives its motivation from practical and everyday computation. Its aim is to understand the nature of computation and, as a consequence of this understanding, provide more efficient methodologies. All papers introducing or studying mathematical, logic and formal concepts and methods are welcome, provided that their motivation is clearly drawn from the field of computing.