Improved Approximation and Scalability for Fair Max-Min Diversification

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory Pub Date : 2022-01-18 DOI:10.4230/LIPIcs.ICDT.2022.7

Raghavendra Addanki, A. Mcgregor, A. Meliou, Zafeiria Moumoulidou

{"title":"Improved Approximation and Scalability for Fair Max-Min Diversification","authors":"Raghavendra Addanki, A. Mcgregor, A. Meliou, Zafeiria Moumoulidou","doi":"10.4230/LIPIcs.ICDT.2022.7","DOIUrl":null,"url":null,"abstract":"Given an $n$-point metric space $(\\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \\ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-\\epsilon$ for any constant $\\epsilon$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $\\epsilon>0$, we present a $1+\\epsilon$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-\\epsilon) k_i$ points from category $i\\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"131 1","pages":"7:1-7:21"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.ICDT.2022.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Given an $n$-point metric space $(\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-\epsilon$ for any constant $\epsilon$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $\epsilon>0$, we present a $1+\epsilon$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-\epsilon) k_i$ points from category $i\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.

查看原文本刊更多论文

公平最大最小分散的改进逼近和可扩展性

给定一个$n$点度量空间$(\mathcal{X}，d)$，其中每个点属于$m=O(1)$个不同类别或组中的一个，以及一组整数$k_1， \ldots, k_m$，公平的Max-Min多样化问题是选择$k_i$个属于类别$i\in [m]$的点，使所选点之间的最小对向距离最大化。该问题是由Moumoulidou等人提出的[ICDT 2021]，其动机是在各种应用中需要对大型数据集进行下采样，以便衍生样本在多样性(即一对选定点之间的最小距离)和公平性(即确保每个类别包含足够的点)之间取得平衡。我们证明了以下结果:1。我们首先考虑一般度量空间。我们提出了一种随机多项式时间算法，该算法返回多样性的因子$2$近似值，但仅满足期望中的公平性约束。在此结果的基础上，我们提出了一个$6$-近似，它保证对任何常数$ $\epsilon$满足一个因子$1- $的公平性约束。我们还提出了一种线性时间算法，返回具有精确公平性的$m+1$近似值。之前最好的结果是300 -1美元的近似值。2. 然后我们关注欧几里得度量。我们首先证明了这个问题可以在一维中精确地解决。对于常数维度，类别和任意常数$\epsilon> $，我们提出了一个$1+\epsilon$近似算法，该算法在$O(nk) + 2^{O(k)}$时间内运行，其中$k=k_1+\ldots+k_m$。我们可以将运行时间提高到$O(nk)+ poly(k)$，代价是只从类别$i\in [m]$中选取$(1-\epsilon) k_i$点。最后，我们提出了适合处理海量数据集的算法，包括单通道数据流算法和用于分布式处理的可组合核心集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

自引率

0.00%

发文量