{"title":"Efficient Dynamic Weighted Set Sampling and Its Extension","authors":"Fangyuan Zhang, Mengxu Jiang, Sibo Wang","doi":"10.14778/3617838.3617840","DOIUrl":null,"url":null,"abstract":"Given a weighted set S of n elements, weighted set sampling (WSS) samples an element in S so that each element a i ; is sampled with a probability proportional to its weight w ( a i ). The classic alias method pre-processes an index in O ( n ) time with O ( n ) space and handles WSS with O (1) time. Yet, the alias method does not support dynamic updates. By minor modifications of existing dynamic WSS schemes, it is possible to achieve an expected O (1) update time and draw t independent samples in expected O ( t ) time with linear space, which is theoretically optimal. But such a method is impractical and even slower than a binary search tree-based solution. How to support both efficient sampling and updates in practice is still challenging. Motivated by this, we design BUS , an efficient scheme that handles an update in O (1) amortized time and draws t independent samples in O (log n + t) time with linear space. A natural extension of WSS is the weighted independent range sampling (WIRS) , where each element in S is a data point from R. Given an arbitrary range Q = [ℓ, r ] at query time, WIRS aims to do weighted set sampling on the set S Q of data points falling into range Q. We show that by integrating the theoretically optimal dynamic WSS scheme mentioned above, it can handle an update in O (log n ) time and can draw t independent samples for WIRS in O (log n + t ) time, the same as the state-of-the-art static algorithm. Again, such a solution by integrating the optimal dynamic WSS scheme is still impractical to handle WIRS queries. We further propose WIRS-BUS to integrate BUS to handle WIRS queries, which handles each update in O (log n ) time and draws t independent samples in O (log 2 n + t ) time with linear space. Extensive experiments show that our BUS and WIRS-BUS are efficient for both sampling and updates.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"3 1","pages":"15-27"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3617838.3617840","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Given a weighted set S of n elements, weighted set sampling (WSS) samples an element in S so that each element a i ; is sampled with a probability proportional to its weight w ( a i ). The classic alias method pre-processes an index in O ( n ) time with O ( n ) space and handles WSS with O (1) time. Yet, the alias method does not support dynamic updates. By minor modifications of existing dynamic WSS schemes, it is possible to achieve an expected O (1) update time and draw t independent samples in expected O ( t ) time with linear space, which is theoretically optimal. But such a method is impractical and even slower than a binary search tree-based solution. How to support both efficient sampling and updates in practice is still challenging. Motivated by this, we design BUS , an efficient scheme that handles an update in O (1) amortized time and draws t independent samples in O (log n + t) time with linear space. A natural extension of WSS is the weighted independent range sampling (WIRS) , where each element in S is a data point from R. Given an arbitrary range Q = [ℓ, r ] at query time, WIRS aims to do weighted set sampling on the set S Q of data points falling into range Q. We show that by integrating the theoretically optimal dynamic WSS scheme mentioned above, it can handle an update in O (log n ) time and can draw t independent samples for WIRS in O (log n + t ) time, the same as the state-of-the-art static algorithm. Again, such a solution by integrating the optimal dynamic WSS scheme is still impractical to handle WIRS queries. We further propose WIRS-BUS to integrate BUS to handle WIRS queries, which handles each update in O (log n ) time and draws t independent samples in O (log 2 n + t ) time with linear space. Extensive experiments show that our BUS and WIRS-BUS are efficient for both sampling and updates.
给定一个包含 n 个元素的加权集合 S,加权集合采样(WSS)对 S 中的元素进行采样,这样每个元素 a i ;被采样的概率与其权重 w ( a i ) 成正比。经典的别名法用 O ( n ) 的时间和 O ( n ) 的空间预处理索引,用 O (1) 的时间处理 WSS。然而,别名法不支持动态更新。通过对现有的动态 WSS 方案稍作修改,可以实现预期 O (1) 更新时间,并在预期 O ( t ) 时间内用线性空间绘制 t 个独立样本,这在理论上是最优的。但这种方法并不实用,甚至比基于二叉搜索树的解决方案更慢。如何在实践中同时支持高效采样和更新仍是一个挑战。受此启发,我们设计了一种高效方案 BUS,它能在 O (1) 个摊销时间内处理更新,并在 O (log n + t) 个线性空间内抽取 t 个独立样本。 给定查询时的任意范围 Q = [ℓ, r ],WIRS 的目的是对范围 Q 中的数据点集合 S Q 进行加权集采样。我们的研究表明,通过整合上述理论上最优的动态 WSS 方案,它可以在 O (log n ) 时间内处理一次更新,并在 O (log n + t ) 时间内为 WIRS 绘制 t 个独立样本,与最先进的静态算法相同。同样,这种通过整合最优动态 WSS 方案来处理 WIRS 查询的解决方案仍然不切实际。我们进一步提出了 WIRS-BUS,以整合 BUS 来处理 WIRS 查询,它能在 O (log n ) 时间内处理每次更新,并在 O (log 2 n + t ) 时间内以线性空间绘制 t 个独立样本。大量实验表明,我们的 BUS 和 WIRS-BUS 在采样和更新方面都很高效。