Shikha Singh, P. Pandey, M. A. Bender, Jonathan W. Berry, Martín Farach-Colton, Rob Johnson, Thomas M. Kroeger, C. Phillips
{"title":"Timely Reporting of Heavy Hitters Using External Memory","authors":"Shikha Singh, P. Pandey, M. A. Bender, Jonathan W. Berry, Martín Farach-Colton, Rob Johnson, Thomas M. Kroeger, C. Phillips","doi":"10.1145/3472392","DOIUrl":null,"url":null,"abstract":"Given an input stream S of size N, a ɸ-heavy hitter is an item that occurs at least ɸN times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"7 1","pages":"1 - 35"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems (TODS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3472392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Given an input stream S of size N, a ɸ-heavy hitter is an item that occurs at least ɸN times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.
给定大小为N的输入流S,一个重敲子是一个在S中出现至少N次的项。查找重敲子的问题在数据库文献中得到了广泛的研究。我们研究了一个实时重磅变体,其中一个元素必须在我们看到它的T = h n次出现后不久报告(因此它成为重磅变体)。我们称之为及时事件检测(TED)问题。TED问题模拟了许多现实世界监测系统的需求,这些系统要求准确(即,无假阴性)和及时地从具有低报告阈值(高灵敏度)的大型高速流中报告所有事件。像经典的重量级人物问题一样,解决TED问题而不出现误报需要很大的空间(Ω (N)个单词)。因此,在ram中,重量级算法通常会牺牲准确性(即允许误报)、灵敏度或及时性(即使用多次传递)。我们展示了如何在保证准确性、灵敏度和及时性的同时,将重量级算法应用于外部存储器,以解决大型高速流上的TED问题。我们的数据结构仅受I/O带宽(而不是延迟)的限制,并支持在报告延迟和I/O开销之间进行可调的权衡。由于报告延迟很小,我们的算法只会产生对数级的I/O开销。我们使用Firehose流基准来实现和验证我们的数据结构。我们结构的多线程版本可以扩展到每秒处理11M个观测值,然后才会受到CPU限制。相比之下,将标准的重量级算法简单地应用于外部存储器将受到存储设备随机I/O吞吐量的限制,即每秒≈100K的观察值。