使用非适应性子集查询进行聚类

arXiv - CS - Data Structures and Algorithms Pub Date : 2024-09-17 DOI:arxiv-2409.10908

Hadley Black, Euiwoong Lee, Arya Mazumdar, Barna Saha

{"title":"使用非适应性子集查询进行聚类","authors":"Hadley Black, Euiwoong Lee, Arya Mazumdar, Barna Saha","doi":"arxiv-2409.10908","DOIUrl":null,"url":null,"abstract":"Recovering the underlying clustering of a set $U$ of $n$ points by asking\npair-wise same-cluster queries has garnered significant interest in the last\ndecade. Given a query $S \\subset U$, $|S|=2$, the oracle returns yes if the\npoints are in the same cluster and no otherwise. For adaptive algorithms with\npair-wise queries, the number of required queries is known to be $\\Theta(nk)$,\nwhere $k$ is the number of clusters. However, non-adaptive schemes require\n$\\Omega(n^2)$ queries, which matches the trivial $O(n^2)$ upper bound attained\nby querying every pair of points. To break the quadratic barrier for non-adaptive queries, we study a\ngeneralization of this problem to subset queries for $|S|>2$, where the oracle\nreturns the number of clusters intersecting $S$. Allowing for subset queries of\nunbounded size, $O(n)$ queries is possible with an adaptive scheme\n(Chakrabarty-Liao, 2024). However, the realm of non-adaptive algorithms is\ncompletely unknown. In this paper, we give the first non-adaptive algorithms for clustering with\nsubset queries. Our main result is a non-adaptive algorithm making $O(n \\log k\n\\cdot (\\log k + \\log\\log n)^2)$ queries, which improves to $O(n \\log \\log n)$\nwhen $k$ is a constant. We also consider algorithms with a restricted query\nsize of at most $s$. In this setting we prove that $\\Omega(\\max(n^2/s^2,n))$\nqueries are necessary and obtain algorithms making $\\tilde{O}(n^2k/s^2)$\nqueries for any $s \\leq \\sqrt{n}$ and $\\tilde{O}(n^2/s)$ queries for any $s\n\\leq n$. We also consider the natural special case when the clusters are\nbalanced, obtaining non-adaptive algorithms which make $O(n \\log k) +\n\\tilde{O}(k)$ and $O(n\\log^2 k)$ queries. Finally, allowing two rounds of\nadaptivity, we give an algorithm making $O(n \\log k)$ queries in the general\ncase and $O(n \\log \\log k)$ queries when the clusters are balanced.","PeriodicalId":501525,"journal":{"name":"arXiv - CS - Data Structures and Algorithms","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Clustering with Non-adaptive Subset Queries\",\"authors\":\"Hadley Black, Euiwoong Lee, Arya Mazumdar, Barna Saha\",\"doi\":\"arxiv-2409.10908\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recovering the underlying clustering of a set $U$ of $n$ points by asking\\npair-wise same-cluster queries has garnered significant interest in the last\\ndecade. Given a query $S \\\\subset U$, $|S|=2$, the oracle returns yes if the\\npoints are in the same cluster and no otherwise. For adaptive algorithms with\\npair-wise queries, the number of required queries is known to be $\\\\Theta(nk)$,\\nwhere $k$ is the number of clusters. However, non-adaptive schemes require\\n$\\\\Omega(n^2)$ queries, which matches the trivial $O(n^2)$ upper bound attained\\nby querying every pair of points. To break the quadratic barrier for non-adaptive queries, we study a\\ngeneralization of this problem to subset queries for $|S|>2$, where the oracle\\nreturns the number of clusters intersecting $S$. Allowing for subset queries of\\nunbounded size, $O(n)$ queries is possible with an adaptive scheme\\n(Chakrabarty-Liao, 2024). However, the realm of non-adaptive algorithms is\\ncompletely unknown. In this paper, we give the first non-adaptive algorithms for clustering with\\nsubset queries. Our main result is a non-adaptive algorithm making $O(n \\\\log k\\n\\\\cdot (\\\\log k + \\\\log\\\\log n)^2)$ queries, which improves to $O(n \\\\log \\\\log n)$\\nwhen $k$ is a constant. We also consider algorithms with a restricted query\\nsize of at most $s$. In this setting we prove that $\\\\Omega(\\\\max(n^2/s^2,n))$\\nqueries are necessary and obtain algorithms making $\\\\tilde{O}(n^2k/s^2)$\\nqueries for any $s \\\\leq \\\\sqrt{n}$ and $\\\\tilde{O}(n^2/s)$ queries for any $s\\n\\\\leq n$. We also consider the natural special case when the clusters are\\nbalanced, obtaining non-adaptive algorithms which make $O(n \\\\log k) +\\n\\\\tilde{O}(k)$ and $O(n\\\\log^2 k)$ queries. Finally, allowing two rounds of\\nadaptivity, we give an algorithm making $O(n \\\\log k)$ queries in the general\\ncase and $O(n \\\\log \\\\log k)$ queries when the clusters are balanced.\",\"PeriodicalId\":501525,\"journal\":{\"name\":\"arXiv - CS - Data Structures and Algorithms\",\"volume\":\"27 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Data Structures and Algorithms\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10908\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Data Structures and Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10908","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在过去的十年中，通过提出成对的同簇查询来恢复由 $n$ 点组成的集合 $U$ 的基本聚类问题引起了人们的极大兴趣。给定查询 $S \subset U$，$|S|=2$，如果点在同一聚类中，则神谕返回 "是"，否则返回 "否"。对于采用成对查询的自适应算法，已知所需的查询次数为 $\theta(nk)$，其中 $k$ 是簇的数量。然而，非自适应方案需要 $Omega(n^2)$ 查询，这与通过查询每一对点而达到的微不足道的 $O(n^2)$ 上限相匹配。为了打破非自适应性查询的二次方障碍，我们研究了将这一问题推广到 $|S|>2$ 的子集查询，在子集查询中，查询器会返回与 $S$ 相交的簇的数目。在允许子集查询大小无界的情况下，用自适应方案可以实现 $O(n)$ 查询（Chakrabarty-Liao，2024 年）。然而，非适应性算法的领域还完全未知。在本文中，我们首次给出了使用子集查询进行聚类的非自适应算法。我们的主要成果是一种非自适应算法，可以实现 $O(n \log k\cdot (\log k + \log\log n)^2)$ 查询，当 $k$ 是常数时，该算法可以提高到 $O(n \log \log n)$。我们还考虑了限制查询大小最多为 $s$ 的算法。在这种情况下，我们证明了$\Omega(\max(n^2/s^2,n))$查询是必要的，并得到了对任意$s \leq \sqrt{n}$和任意$s\leq n$进行$\tilde{O}(n^2k/s^2)$查询的算法。我们还考虑了簇平衡时的自然特例，得到了非适应性算法，其查询次数为 $O(n \log k) +\tilde{O}(k)$ 和 $O(n\log^2 k)$。最后，在允许两轮自适应的情况下，我们给出了在一般情况下进行 $O(n \log k)$ 查询的算法，以及在簇平衡时进行 $O(n \log \log k)$ 查询的算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Clustering with Non-adaptive Subset Queries

Recovering the underlying clustering of a set $U$ of $n$ points by asking pair-wise same-cluster queries has garnered significant interest in the last decade. Given a query $S \subset U$, $|S|=2$, the oracle returns yes if the points are in the same cluster and no otherwise. For adaptive algorithms with pair-wise queries, the number of required queries is known to be $\Theta(nk)$, where $k$ is the number of clusters. However, non-adaptive schemes require $\Omega(n^2)$ queries, which matches the trivial $O(n^2)$ upper bound attained by querying every pair of points. To break the quadratic barrier for non-adaptive queries, we study a generalization of this problem to subset queries for $|S|>2$, where the oracle returns the number of clusters intersecting $S$. Allowing for subset queries of unbounded size, $O(n)$ queries is possible with an adaptive scheme (Chakrabarty-Liao, 2024). However, the realm of non-adaptive algorithms is completely unknown. In this paper, we give the first non-adaptive algorithms for clustering with subset queries. Our main result is a non-adaptive algorithm making $O(n \log k \cdot (\log k + \log\log n)^2)$ queries, which improves to $O(n \log \log n)$ when $k$ is a constant. We also consider algorithms with a restricted query size of at most $s$. In this setting we prove that $\Omega(\max(n^2/s^2,n))$ queries are necessary and obtain algorithms making $\tilde{O}(n^2k/s^2)$ queries for any $s \leq \sqrt{n}$ and $\tilde{O}(n^2/s)$ queries for any $s \leq n$. We also consider the natural special case when the clusters are balanced, obtaining non-adaptive algorithms which make $O(n \log k) + \tilde{O}(k)$ and $O(n\log^2 k)$ queries. Finally, allowing two rounds of adaptivity, we give an algorithm making $O(n \log k)$ queries in the general case and $O(n \log \log k)$ queries when the clusters are balanced.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Data Structures and Algorithms

自引率

0.00%

发文量