Improved Approximation and Scalability for Fair Max-Min Diversification

Raghavendra Addanki, A. Mcgregor, A. Meliou, Zafeiria Moumoulidou
{"title":"Improved Approximation and Scalability for Fair Max-Min Diversification","authors":"Raghavendra Addanki, A. Mcgregor, A. Meliou, Zafeiria Moumoulidou","doi":"10.4230/LIPIcs.ICDT.2022.7","DOIUrl":null,"url":null,"abstract":"Given an $n$-point metric space $(\\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \\ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-\\epsilon$ for any constant $\\epsilon$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $\\epsilon>0$, we present a $1+\\epsilon$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-\\epsilon) k_i$ points from category $i\\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.ICDT.2022.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Given an $n$-point metric space $(\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-\epsilon$ for any constant $\epsilon$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $\epsilon>0$, we present a $1+\epsilon$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-\epsilon) k_i$ points from category $i\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.
公平最大最小分散的改进逼近和可扩展性
给定一个$n$点度量空间$(\mathcal{X},d)$,其中每个点属于$m=O(1)$个不同类别或组中的一个,以及一组整数$k_1, \ldots, k_m$,公平的Max-Min多样化问题是选择$k_i$个属于类别$i\in [m]$的点,使所选点之间的最小对向距离最大化。该问题是由Moumoulidou等人提出的[ICDT 2021],其动机是在各种应用中需要对大型数据集进行下采样,以便衍生样本在多样性(即一对选定点之间的最小距离)和公平性(即确保每个类别包含足够的点)之间取得平衡。我们证明了以下结果:1。我们首先考虑一般度量空间。我们提出了一种随机多项式时间算法,该算法返回多样性的因子$2$近似值,但仅满足期望中的公平性约束。在此结果的基础上,我们提出了一个$6$-近似,它保证对任何常数$ $\epsilon$满足一个因子$1- $的公平性约束。我们还提出了一种线性时间算法,返回具有精确公平性的$m+1$近似值。之前最好的结果是300 -1美元的近似值。2. 然后我们关注欧几里得度量。我们首先证明了这个问题可以在一维中精确地解决。对于常数维度,类别和任意常数$\epsilon> $,我们提出了一个$1+\epsilon$近似算法,该算法在$O(nk) + 2^{O(k)}$时间内运行,其中$k=k_1+\ldots+k_m$。我们可以将运行时间提高到$O(nk)+ poly(k)$,代价是只从类别$i\in [m]$中选取$(1-\epsilon) k_i$点。最后,我们提出了适合处理海量数据集的算法,包括单通道数据流算法和用于分布式处理的可组合核心集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信