Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams

arXiv - CS - Data Structures and Algorithms Pub Date : 2024-09-10 DOI:arxiv-2409.06199

Matthew Andres Moreno, Luis Zaman, Emily Dolson

{"title":"Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams","authors":"Matthew Andres Moreno, Luis Zaman, Emily Dolson","doi":"arxiv-2409.06199","DOIUrl":null,"url":null,"abstract":"Operations over data streams typically hinge on efficient mechanisms to\naggregate or summarize history on a rolling basis. For high-volume data steams,\nit is critical to manage state in a manner that is fast and memory efficient --\nparticularly in resource-constrained or real-time contexts. Here, we address\nthe problem of extracting a fixed-capacity, rolling subsample from a data\nstream. Specifically, we explore ``data stream curation'' strategies to fulfill\nrequirements on the composition of sample time points retained. Our ``DStream''\nsuite of algorithms targets three temporal coverage criteria: (1) steady\ncoverage, where retained samples should spread evenly across elapsed data\nstream history; (2) stretched coverage, where early data items should be\nproportionally favored; and (3) tilted coverage, where recent data items should\nbe proportionally favored. For each algorithm, we prove worst-case bounds on\nrolling coverage quality. We focus on the more practical, application-driven\ncase of maximizing coverage quality given a fixed memory capacity. As a core\nsimplifying assumption, we restrict algorithm design to a single update\noperation: writing from the data stream to a calculated buffer site -- with\ndata never being read back, no metadata stored (e.g., sample timestamps), and\ndata eviction occurring only implicitly via overwrite. Drawing only on\nprimitive, low-level operations and ensuring full, overhead-free use of\navailable memory, this ``DStream'' framework ideally suits domains that are\nresource-constrained, performance-critical, and fine-grained (e.g., individual\ndata items as small as single bits or bytes). The proposed approach supports\n$\\mathcal{O}(1)$ data ingestion via concise bit-level operations. To further\npractical applications, we provide plug-and-play open-source implementations\ntargeting both scripted and compiled application domains.","PeriodicalId":501525,"journal":{"name":"arXiv - CS - Data Structures and Algorithms","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Data Structures and Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06199","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Operations over data streams typically hinge on efficient mechanisms to aggregate or summarize history on a rolling basis. For high-volume data steams, it is critical to manage state in a manner that is fast and memory efficient -- particularly in resource-constrained or real-time contexts. Here, we address the problem of extracting a fixed-capacity, rolling subsample from a data stream. Specifically, we explore ``data stream curation'' strategies to fulfill requirements on the composition of sample time points retained. Our ``DStream'' suite of algorithms targets three temporal coverage criteria: (1) steady coverage, where retained samples should spread evenly across elapsed data stream history; (2) stretched coverage, where early data items should be proportionally favored; and (3) tilted coverage, where recent data items should be proportionally favored. For each algorithm, we prove worst-case bounds on rolling coverage quality. We focus on the more practical, application-driven case of maximizing coverage quality given a fixed memory capacity. As a core simplifying assumption, we restrict algorithm design to a single update operation: writing from the data stream to a calculated buffer site -- with data never being read back, no metadata stored (e.g., sample timestamps), and data eviction occurring only implicitly via overwrite. Drawing only on primitive, low-level operations and ensuring full, overhead-free use of available memory, this ``DStream'' framework ideally suits domains that are resource-constrained, performance-critical, and fine-grained (e.g., individual data items as small as single bits or bytes). The proposed approach supports $\mathcal{O}(1)$ data ingestion via concise bit-level operations. To further practical applications, we provide plug-and-play open-source implementations targeting both scripted and compiled application domains.

查看原文本刊更多论文

利用结构化下采样实现在线数据流的快速、内存高效整理

对数据流的操作通常取决于在滚动基础上汇总或总结历史记录的高效机制。对于大容量数据流来说，以快速、高效内存的方式管理状态至关重要，尤其是在资源受限或实时的情况下。在此，我们探讨了从数据流中提取固定容量、滚动子样本的问题。具体来说，我们探索了 "数据流整理 "策略，以满足对所保留样本时间点组成的要求。我们的 "数据流 "算法套件针对三个时间覆盖标准：(1) 稳定覆盖，即保留的样本应均匀分布在数据流历史中；(2) 拉伸覆盖，即应按比例偏重早期数据项；(3) 倾斜覆盖，即应按比例偏重近期数据项。对于每种算法，我们都证明了滚动覆盖质量的最坏情况界限。我们的重点是在内存容量固定的情况下最大化覆盖质量这一更为实际的应用驱动案例。作为一个核心简化假设，我们将算法设计限制为单一更新操作：从数据流写入一个计算好的缓冲区位置--撤回的数据永远不会被读回，不存储元数据（如样本时间戳），数据驱逐仅通过覆盖隐式发生。这种 "DStream "框架只采用原始的低级操作，并确保充分、无开销地使用可用内存，非常适合资源受限、性能关键和细粒度的领域（例如，小到单个比特或字节的单个数据项）。所提出的方法通过简洁的比特级操作支持$/mathcal{O}(1)$数据摄取。为了促进实际应用，我们提供了针对脚本和编译应用领域的即插即用开源实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Data Structures and Algorithms

自引率

0.00%

发文量