{"title":"MPC和流模型中具有离群值的k-中心聚类","authors":"M. D. Berg, Leyla Biabani, M. Monemizadeh","doi":"10.1109/IPDPS54959.2023.00090","DOIUrl":null,"url":null,"abstract":"Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set ${{\\mathcal{C}}^ * } = \\{ c_1^ * , \\cdots ,c_k^ * \\} \\subseteq X$ of k centers such that the maximum distance of all but z points of P to their nearest center in C* is minimized. An (ε, k, z)-coreset for this problem is a weighted point set P* such that an optimal solution for the k-center problem with z outliers on P* gives a (1 ± ε)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ε ⩽ 1: In all cases, the size of the computed coreset is O(k/εd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using $O(\\sqrt n )$ machines, where the worker machines have $O(\\sqrt {nk/{\\varepsilon ^d}} + \\sqrt n \\cdot \\log (z + 1))$ local memory, and the coordinator has $O(\\sqrt {nk/{\\varepsilon ^d}} + \\sqrt n \\cdot \\log (z + 1) + z)$ local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ε, k, z)-coreset must use Ω(k/εd + z) space. We complement this by a deterministic streaming algorithm using O(k/εd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/εd + z)log4(kΔ/εδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/εd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ε)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/εd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh, and Zhong [1].","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"k-Center Clustering with Outliers in the MPC and Streaming Model\",\"authors\":\"M. D. Berg, Leyla Biabani, M. Monemizadeh\",\"doi\":\"10.1109/IPDPS54959.2023.00090\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set ${{\\\\mathcal{C}}^ * } = \\\\{ c_1^ * , \\\\cdots ,c_k^ * \\\\} \\\\subseteq X$ of k centers such that the maximum distance of all but z points of P to their nearest center in C* is minimized. An (ε, k, z)-coreset for this problem is a weighted point set P* such that an optimal solution for the k-center problem with z outliers on P* gives a (1 ± ε)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ε ⩽ 1: In all cases, the size of the computed coreset is O(k/εd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using $O(\\\\sqrt n )$ machines, where the worker machines have $O(\\\\sqrt {nk/{\\\\varepsilon ^d}} + \\\\sqrt n \\\\cdot \\\\log (z + 1))$ local memory, and the coordinator has $O(\\\\sqrt {nk/{\\\\varepsilon ^d}} + \\\\sqrt n \\\\cdot \\\\log (z + 1) + z)$ local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ε, k, z)-coreset must use Ω(k/εd + z) space. We complement this by a deterministic streaming algorithm using O(k/εd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/εd + z)log4(kΔ/εδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/εd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ε)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/εd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh, and Zhong [1].\",\"PeriodicalId\":343684,\"journal\":{\"name\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS54959.2023.00090\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
k-Center Clustering with Outliers in the MPC and Streaming Model
Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set ${{\mathcal{C}}^ * } = \{ c_1^ * , \cdots ,c_k^ * \} \subseteq X$ of k centers such that the maximum distance of all but z points of P to their nearest center in C* is minimized. An (ε, k, z)-coreset for this problem is a weighted point set P* such that an optimal solution for the k-center problem with z outliers on P* gives a (1 ± ε)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ε ⩽ 1: In all cases, the size of the computed coreset is O(k/εd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using $O(\sqrt n )$ machines, where the worker machines have $O(\sqrt {nk/{\varepsilon ^d}} + \sqrt n \cdot \log (z + 1))$ local memory, and the coordinator has $O(\sqrt {nk/{\varepsilon ^d}} + \sqrt n \cdot \log (z + 1) + z)$ local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ε, k, z)-coreset must use Ω(k/εd + z) space. We complement this by a deterministic streaming algorithm using O(k/εd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/εd + z)log4(kΔ/εδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/εd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ε)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/εd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh, and Zhong [1].