{"title":"Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams","authors":"Matthew Andres Moreno, Luis Zaman, Emily Dolson","doi":"arxiv-2409.06199","DOIUrl":null,"url":null,"abstract":"Operations over data streams typically hinge on efficient mechanisms to\naggregate or summarize history on a rolling basis. For high-volume data steams,\nit is critical to manage state in a manner that is fast and memory efficient --\nparticularly in resource-constrained or real-time contexts. Here, we address\nthe problem of extracting a fixed-capacity, rolling subsample from a data\nstream. Specifically, we explore ``data stream curation'' strategies to fulfill\nrequirements on the composition of sample time points retained. Our ``DStream''\nsuite of algorithms targets three temporal coverage criteria: (1) steady\ncoverage, where retained samples should spread evenly across elapsed data\nstream history; (2) stretched coverage, where early data items should be\nproportionally favored; and (3) tilted coverage, where recent data items should\nbe proportionally favored. For each algorithm, we prove worst-case bounds on\nrolling coverage quality. We focus on the more practical, application-driven\ncase of maximizing coverage quality given a fixed memory capacity. As a core\nsimplifying assumption, we restrict algorithm design to a single update\noperation: writing from the data stream to a calculated buffer site -- with\ndata never being read back, no metadata stored (e.g., sample timestamps), and\ndata eviction occurring only implicitly via overwrite. Drawing only on\nprimitive, low-level operations and ensuring full, overhead-free use of\navailable memory, this ``DStream'' framework ideally suits domains that are\nresource-constrained, performance-critical, and fine-grained (e.g., individual\ndata items as small as single bits or bytes). The proposed approach supports\n$\\mathcal{O}(1)$ data ingestion via concise bit-level operations. To further\npractical applications, we provide plug-and-play open-source implementations\ntargeting both scripted and compiled application domains.","PeriodicalId":501525,"journal":{"name":"arXiv - CS - Data Structures and Algorithms","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Data Structures and Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06199","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Operations over data streams typically hinge on efficient mechanisms to
aggregate or summarize history on a rolling basis. For high-volume data steams,
it is critical to manage state in a manner that is fast and memory efficient --
particularly in resource-constrained or real-time contexts. Here, we address
the problem of extracting a fixed-capacity, rolling subsample from a data
stream. Specifically, we explore ``data stream curation'' strategies to fulfill
requirements on the composition of sample time points retained. Our ``DStream''
suite of algorithms targets three temporal coverage criteria: (1) steady
coverage, where retained samples should spread evenly across elapsed data
stream history; (2) stretched coverage, where early data items should be
proportionally favored; and (3) tilted coverage, where recent data items should
be proportionally favored. For each algorithm, we prove worst-case bounds on
rolling coverage quality. We focus on the more practical, application-driven
case of maximizing coverage quality given a fixed memory capacity. As a core
simplifying assumption, we restrict algorithm design to a single update
operation: writing from the data stream to a calculated buffer site -- with
data never being read back, no metadata stored (e.g., sample timestamps), and
data eviction occurring only implicitly via overwrite. Drawing only on
primitive, low-level operations and ensuring full, overhead-free use of
available memory, this ``DStream'' framework ideally suits domains that are
resource-constrained, performance-critical, and fine-grained (e.g., individual
data items as small as single bits or bytes). The proposed approach supports
$\mathcal{O}(1)$ data ingestion via concise bit-level operations. To further
practical applications, we provide plug-and-play open-source implementations
targeting both scripted and compiled application domains.