真正完美的采样器的数据流和滑动窗口

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-08-26 DOI:10.1145/3517804.3524139

Rajesh Jayaram, David P. Woodruff, Samson Zhou

{"title":"真正完美的采样器的数据流和滑动窗口","authors":"Rajesh Jayaram, David P. Woodruff, Samson Zhou","doi":"10.1145/3517804.3524139","DOIUrl":null,"url":null,"abstract":"In the G-sampling problem, the goal is to output an index i of a vector f ∈ Rn, such that for all coordinates j ∈[n], [Pr [i=j] = (1 ± ε) (G(fj))/(∑k ∈[n] G(fk)) + γ,] where G: R → R ≥ 0 is some non-negative function. If ε = 0 and γ = 1/poly(n), the sampler is calledperfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are nottruly perfect, since their output distribution is only point-wise γ = 1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study oftruly perfect samplers, with ε = γ = 0, and comprehensively investigate their complexity in the data stream and sliding window models. We begin by showing that sublinear space truly perfect sampling is impossible in the turnstile model, by proving a lower bound of Ω(min(n, log 1/γ)) for any G-sampler with point-wise error γ from the true distribution. We then give a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models. As specific applications, our framework addresses Lp sampling for all p>0, e.g., Õn1-1/p space for p ≥ 1, concave functions, and a large number of measure functions, including the L1-L2, Fair, Huber, and Tukey estimators. The update time of our truly perfect Lp-samplers is Ø(1), which is an exponential improvement over the running time of previous perfect Lp-samplers.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Truly Perfect Samplers for Data Streams and Sliding Windows\",\"authors\":\"Rajesh Jayaram, David P. Woodruff, Samson Zhou\",\"doi\":\"10.1145/3517804.3524139\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the G-sampling problem, the goal is to output an index i of a vector f ∈ Rn, such that for all coordinates j ∈[n], [Pr [i=j] = (1 ± ε) (G(fj))/(∑k ∈[n] G(fk)) + γ,] where G: R → R ≥ 0 is some non-negative function. If ε = 0 and γ = 1/poly(n), the sampler is calledperfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are nottruly perfect, since their output distribution is only point-wise γ = 1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study oftruly perfect samplers, with ε = γ = 0, and comprehensively investigate their complexity in the data stream and sliding window models. We begin by showing that sublinear space truly perfect sampling is impossible in the turnstile model, by proving a lower bound of Ω(min(n, log 1/γ)) for any G-sampler with point-wise error γ from the true distribution. We then give a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models. As specific applications, our framework addresses Lp sampling for all p>0, e.g., Õn1-1/p space for p ≥ 1, concave functions, and a large number of measure functions, including the L1-L2, Fair, Huber, and Tukey estimators. The update time of our truly perfect Lp-samplers is Ø(1), which is an exponential improvement over the running time of previous perfect Lp-samplers.\",\"PeriodicalId\":230606,\"journal\":{\"name\":\"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3517804.3524139\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3517804.3524139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

在G-抽样问题中，目标是输出向量f∈Rn的一个指标i，使得对于所有坐标j∈[n]， [Pr [i=j] =(1±ε) (G(fj))/(∑k∈[n] G(fk)) + γ，]其中G: R→R≥0是某个非负函数。如果ε = 0， γ = 1/poly(n)，则称为完美采样器。在数据流模型中，f通过对其坐标的一系列更新来隐式定义，目标是在小空间中设计这样一个采样器。Jayaram和Woodruff (FOCS 2018)在旋转门流中给出了第一个完美的Lp采样器，其中G(x)=|x|p，使用polylog(n)空间对p∈(0,2)。然而，到目前为止，所有已知的采样算法都不是真正完美的，因为它们的输出分布仅接近真实分布的逐点γ = 1/poly(n)。当采样器在流的连续部分上多次运行时，这个小错误可能很重要，并且可能泄露有关数据流的敏感信息。在这项工作中，我们启动了ε = γ = 0的真正完美采样器的研究，并全面研究了它们在数据流和滑动窗口模型中的复杂性。我们首先证明，在旋转门模型中，次线性空间真正完美的采样是不可能的，通过证明任何g抽样器的下界Ω(min(n, log 1/γ))，从真分布的点误差γ。然后，我们给出了一个通用的时间效率的亚线性空间框架，用于在仅插入流和滑动窗口模型中开发真正完美的采样器。作为具体应用，我们的框架解决了所有p>0的Lp采样，例如，p≥1的Õn1-1/p空间，凹函数和大量测量函数，包括L1-L2, Fair, Huber和Tukey估计。我们真正完美的lp采样器的更新时间为Ø(1)，与之前的完美lp采样器的运行时间相比，这是一个指数级的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Truly Perfect Samplers for Data Streams and Sliding Windows

In the G-sampling problem, the goal is to output an index i of a vector f ∈ Rn, such that for all coordinates j ∈[n], [Pr [i=j] = (1 ± ε) (G(fj))/(∑k ∈[n] G(fk)) + γ,] where G: R → R ≥ 0 is some non-negative function. If ε = 0 and γ = 1/poly(n), the sampler is calledperfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are nottruly perfect, since their output distribution is only point-wise γ = 1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study oftruly perfect samplers, with ε = γ = 0, and comprehensively investigate their complexity in the data stream and sliding window models. We begin by showing that sublinear space truly perfect sampling is impossible in the turnstile model, by proving a lower bound of Ω(min(n, log 1/γ)) for any G-sampler with point-wise error γ from the true distribution. We then give a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models. As specific applications, our framework addresses Lp sampling for all p>0, e.g., Õn1-1/p space for p ≥ 1, concave functions, and a large number of measure functions, including the L1-L2, Fair, Huber, and Tukey estimators. The update time of our truly perfect Lp-samplers is Ø(1), which is an exponential improvement over the running time of previous perfect Lp-samplers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

自引率

0.00%

发文量