{"title":"Maximum Coverage in Sublinear Space, Faster","authors":"Stephen Jaud, Anthony Wirth, F. Choudhury","doi":"10.48550/arXiv.2302.06137","DOIUrl":null,"url":null,"abstract":"Given a collection of $m$ sets from a universe $\\mathcal{U}$, the Maximum Set Coverage problem consists of finding $k$ sets whose union has largest cardinality. This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor $1-1/e$. However, this algorithm does not scale well with the input size. In a streaming context, practical high-quality solutions are found, but with space complexity that scales linearly with respect to the size of the universe $|\\mathcal{U}|$. However, one randomized streaming algorithm has been shown to produce a $1-1/e-\\varepsilon$ approximation of the optimal solution with a space complexity that scales only poly-logarithmically with respect to $m$ and $|\\mathcal{U}|$. In order to achieve such a low space complexity, the authors used a technique called subsampling, based on independent-wise hash functions. This article focuses on this sublinear-space algorithm and introduces methods to reduce the time cost of subsampling. We first show how to accelerate by several orders of magnitude without altering the space complexity, number of passes and approximation quality of the original algorithm. Secondly, we derive a new lower bound for the probability of producing a $1-1/e-\\varepsilon$ approximation using only pairwise independence: $1-\\tfrac{4}{c k \\log m}$ compared to the original $1-\\tfrac{2e}{m^{ck/6}}$. Although the theoretical approximation guarantees are weaker, for large streams, our algorithm performs well in practice and present the best time-space-performance trade-off for maximum coverage in streams.","PeriodicalId":9448,"journal":{"name":"Bulletin of the Society of Sea Water Science, Japan","volume":"37 1","pages":"21:1-21:20"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Society of Sea Water Science, Japan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2302.06137","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Given a collection of $m$ sets from a universe $\mathcal{U}$, the Maximum Set Coverage problem consists of finding $k$ sets whose union has largest cardinality. This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor $1-1/e$. However, this algorithm does not scale well with the input size. In a streaming context, practical high-quality solutions are found, but with space complexity that scales linearly with respect to the size of the universe $|\mathcal{U}|$. However, one randomized streaming algorithm has been shown to produce a $1-1/e-\varepsilon$ approximation of the optimal solution with a space complexity that scales only poly-logarithmically with respect to $m$ and $|\mathcal{U}|$. In order to achieve such a low space complexity, the authors used a technique called subsampling, based on independent-wise hash functions. This article focuses on this sublinear-space algorithm and introduces methods to reduce the time cost of subsampling. We first show how to accelerate by several orders of magnitude without altering the space complexity, number of passes and approximation quality of the original algorithm. Secondly, we derive a new lower bound for the probability of producing a $1-1/e-\varepsilon$ approximation using only pairwise independence: $1-\tfrac{4}{c k \log m}$ compared to the original $1-\tfrac{2e}{m^{ck/6}}$. Although the theoretical approximation guarantees are weaker, for large streams, our algorithm performs well in practice and present the best time-space-performance trade-off for maximum coverage in streams.
给定一个宇宙$\mathcal{U}$中$m$个集合的集合,最大集合覆盖问题包括找到其并集具有最大基数的$k$个集合。这个问题是np困难的,但是解决方案可以用多项式时间算法近似到一个因子$1-1/e$。然而,该算法不能很好地随输入大小进行伸缩。在流环境中,找到了实用的高质量解决方案,但具有相对于宇宙大小线性扩展的空间复杂性$|\mathcal{U}|$。然而,一种随机流算法已被证明可以产生最优解的$1-1/e-\varepsilon$近似,其空间复杂度仅相对于$m$和$|\mathcal{U}|$进行多对数缩放。为了实现如此低的空间复杂度,作者使用了一种基于独立哈希函数的称为子采样的技术。本文重点研究了这种次线性空间算法,并介绍了降低次采样时间成本的方法。我们首先展示了如何在不改变原始算法的空间复杂度、通过次数和近似质量的情况下加速几个数量级。其次,我们推导出仅使用成对独立产生$1-1/e-\varepsilon$近似的概率的新下界:$1-\tfrac{4}{c k \log m}$与原始的$1-\tfrac{2e}{m^{ck/6}}$相比。虽然理论上的近似保证较弱,但对于大型流,我们的算法在实践中表现良好,并且在流的最大覆盖方面提供了最佳的时间-空间性能权衡。