Raghavendra Addanki, A. Mcgregor, A. Meliou, Zafeiria Moumoulidou
{"title":"公平最大最小分散的改进逼近和可扩展性","authors":"Raghavendra Addanki, A. Mcgregor, A. Meliou, Zafeiria Moumoulidou","doi":"10.4230/LIPIcs.ICDT.2022.7","DOIUrl":null,"url":null,"abstract":"Given an $n$-point metric space $(\\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \\ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-\\epsilon$ for any constant $\\epsilon$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $\\epsilon>0$, we present a $1+\\epsilon$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-\\epsilon) k_i$ points from category $i\\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"131 1","pages":"7:1-7:21"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Improved Approximation and Scalability for Fair Max-Min Diversification\",\"authors\":\"Raghavendra Addanki, A. Mcgregor, A. Meliou, Zafeiria Moumoulidou\",\"doi\":\"10.4230/LIPIcs.ICDT.2022.7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given an $n$-point metric space $(\\\\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \\\\ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\\\\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-\\\\epsilon$ for any constant $\\\\epsilon$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $\\\\epsilon>0$, we present a $1+\\\\epsilon$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\\\\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-\\\\epsilon) k_i$ points from category $i\\\\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.\",\"PeriodicalId\":90482,\"journal\":{\"name\":\"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory\",\"volume\":\"131 1\",\"pages\":\"7:1-7:21\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/LIPIcs.ICDT.2022.7\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.ICDT.2022.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improved Approximation and Scalability for Fair Max-Min Diversification
Given an $n$-point metric space $(\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-\epsilon$ for any constant $\epsilon$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $\epsilon>0$, we present a $1+\epsilon$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-\epsilon) k_i$ points from category $i\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.