{"title":"利用结构化下采样实现在线数据流的快速、内存高效整理","authors":"Matthew Andres Moreno, Luis Zaman, Emily Dolson","doi":"arxiv-2409.06199","DOIUrl":null,"url":null,"abstract":"Operations over data streams typically hinge on efficient mechanisms to\naggregate or summarize history on a rolling basis. For high-volume data steams,\nit is critical to manage state in a manner that is fast and memory efficient --\nparticularly in resource-constrained or real-time contexts. Here, we address\nthe problem of extracting a fixed-capacity, rolling subsample from a data\nstream. Specifically, we explore ``data stream curation'' strategies to fulfill\nrequirements on the composition of sample time points retained. Our ``DStream''\nsuite of algorithms targets three temporal coverage criteria: (1) steady\ncoverage, where retained samples should spread evenly across elapsed data\nstream history; (2) stretched coverage, where early data items should be\nproportionally favored; and (3) tilted coverage, where recent data items should\nbe proportionally favored. For each algorithm, we prove worst-case bounds on\nrolling coverage quality. We focus on the more practical, application-driven\ncase of maximizing coverage quality given a fixed memory capacity. As a core\nsimplifying assumption, we restrict algorithm design to a single update\noperation: writing from the data stream to a calculated buffer site -- with\ndata never being read back, no metadata stored (e.g., sample timestamps), and\ndata eviction occurring only implicitly via overwrite. Drawing only on\nprimitive, low-level operations and ensuring full, overhead-free use of\navailable memory, this ``DStream'' framework ideally suits domains that are\nresource-constrained, performance-critical, and fine-grained (e.g., individual\ndata items as small as single bits or bytes). The proposed approach supports\n$\\mathcal{O}(1)$ data ingestion via concise bit-level operations. To further\npractical applications, we provide plug-and-play open-source implementations\ntargeting both scripted and compiled application domains.","PeriodicalId":501525,"journal":{"name":"arXiv - CS - Data Structures and Algorithms","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams\",\"authors\":\"Matthew Andres Moreno, Luis Zaman, Emily Dolson\",\"doi\":\"arxiv-2409.06199\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Operations over data streams typically hinge on efficient mechanisms to\\naggregate or summarize history on a rolling basis. For high-volume data steams,\\nit is critical to manage state in a manner that is fast and memory efficient --\\nparticularly in resource-constrained or real-time contexts. Here, we address\\nthe problem of extracting a fixed-capacity, rolling subsample from a data\\nstream. Specifically, we explore ``data stream curation'' strategies to fulfill\\nrequirements on the composition of sample time points retained. Our ``DStream''\\nsuite of algorithms targets three temporal coverage criteria: (1) steady\\ncoverage, where retained samples should spread evenly across elapsed data\\nstream history; (2) stretched coverage, where early data items should be\\nproportionally favored; and (3) tilted coverage, where recent data items should\\nbe proportionally favored. For each algorithm, we prove worst-case bounds on\\nrolling coverage quality. We focus on the more practical, application-driven\\ncase of maximizing coverage quality given a fixed memory capacity. As a core\\nsimplifying assumption, we restrict algorithm design to a single update\\noperation: writing from the data stream to a calculated buffer site -- with\\ndata never being read back, no metadata stored (e.g., sample timestamps), and\\ndata eviction occurring only implicitly via overwrite. Drawing only on\\nprimitive, low-level operations and ensuring full, overhead-free use of\\navailable memory, this ``DStream'' framework ideally suits domains that are\\nresource-constrained, performance-critical, and fine-grained (e.g., individual\\ndata items as small as single bits or bytes). The proposed approach supports\\n$\\\\mathcal{O}(1)$ data ingestion via concise bit-level operations. To further\\npractical applications, we provide plug-and-play open-source implementations\\ntargeting both scripted and compiled application domains.\",\"PeriodicalId\":501525,\"journal\":{\"name\":\"arXiv - CS - Data Structures and Algorithms\",\"volume\":\"27 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Data Structures and Algorithms\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06199\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Data Structures and Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06199","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams
Operations over data streams typically hinge on efficient mechanisms to
aggregate or summarize history on a rolling basis. For high-volume data steams,
it is critical to manage state in a manner that is fast and memory efficient --
particularly in resource-constrained or real-time contexts. Here, we address
the problem of extracting a fixed-capacity, rolling subsample from a data
stream. Specifically, we explore ``data stream curation'' strategies to fulfill
requirements on the composition of sample time points retained. Our ``DStream''
suite of algorithms targets three temporal coverage criteria: (1) steady
coverage, where retained samples should spread evenly across elapsed data
stream history; (2) stretched coverage, where early data items should be
proportionally favored; and (3) tilted coverage, where recent data items should
be proportionally favored. For each algorithm, we prove worst-case bounds on
rolling coverage quality. We focus on the more practical, application-driven
case of maximizing coverage quality given a fixed memory capacity. As a core
simplifying assumption, we restrict algorithm design to a single update
operation: writing from the data stream to a calculated buffer site -- with
data never being read back, no metadata stored (e.g., sample timestamps), and
data eviction occurring only implicitly via overwrite. Drawing only on
primitive, low-level operations and ensuring full, overhead-free use of
available memory, this ``DStream'' framework ideally suits domains that are
resource-constrained, performance-critical, and fine-grained (e.g., individual
data items as small as single bits or bytes). The proposed approach supports
$\mathcal{O}(1)$ data ingestion via concise bit-level operations. To further
practical applications, we provide plug-and-play open-source implementations
targeting both scripted and compiled application domains.