Clustering Permutations: New Techniques with Streaming Applications

Diptarka Chakraborty, Debarati Das, Robert Krauthgamer
{"title":"Clustering Permutations: New Techniques with Streaming Applications","authors":"Diptarka Chakraborty, Debarati Das, Robert Krauthgamer","doi":"10.48550/arXiv.2212.01821","DOIUrl":null,"url":null,"abstract":"We study the classical metric $k$-median clustering problem over a set of input rankings (i.e., permutations), which has myriad applications, from social-choice theory to web search and databases. A folklore algorithm provides a $2$-approximate solution in polynomial time for all $k=O(1)$, and works irrespective of the underlying distance measure, so long it is a metric; however, going below the $2$-factor is a notorious challenge. We consider the Ulam distance, a variant of the well-known edit-distance metric, where strings are restricted to be permutations. For this metric, Chakraborty, Das, and Krauthgamer [SODA, 2021] provided a $(2-\\delta)$-approximation algorithm for $k=1$, where $\\delta\\approx 2^{-40}$. Our primary contribution is a new algorithmic framework for clustering a set of permutations. Our first result is a $1.999$-approximation algorithm for the metric $k$-median problem under the Ulam metric, that runs in time $(k \\log (nd))^{O(k)}n d^3$ for an input consisting of $n$ permutations over $[d]$. In fact, our framework is powerful enough to extend this result to the streaming model (where the $n$ input permutations arrive one by one) using only polylogarithmic (in $n$) space. Additionally, we show that similar results can be obtained even in the presence of outliers, which is presumably a more difficult problem.","PeriodicalId":123734,"journal":{"name":"Information Technology Convergence and Services","volume":"115 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Technology Convergence and Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2212.01821","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

We study the classical metric $k$-median clustering problem over a set of input rankings (i.e., permutations), which has myriad applications, from social-choice theory to web search and databases. A folklore algorithm provides a $2$-approximate solution in polynomial time for all $k=O(1)$, and works irrespective of the underlying distance measure, so long it is a metric; however, going below the $2$-factor is a notorious challenge. We consider the Ulam distance, a variant of the well-known edit-distance metric, where strings are restricted to be permutations. For this metric, Chakraborty, Das, and Krauthgamer [SODA, 2021] provided a $(2-\delta)$-approximation algorithm for $k=1$, where $\delta\approx 2^{-40}$. Our primary contribution is a new algorithmic framework for clustering a set of permutations. Our first result is a $1.999$-approximation algorithm for the metric $k$-median problem under the Ulam metric, that runs in time $(k \log (nd))^{O(k)}n d^3$ for an input consisting of $n$ permutations over $[d]$. In fact, our framework is powerful enough to extend this result to the streaming model (where the $n$ input permutations arrive one by one) using only polylogarithmic (in $n$) space. Additionally, we show that similar results can be obtained even in the presence of outliers, which is presumably a more difficult problem.
聚类排列:流应用的新技术
我们研究了一组输入排名(即排列)上的经典度量$k$ -中位数聚类问题,它有无数的应用,从社会选择理论到网络搜索和数据库。民俗算法在多项式时间内为所有$k=O(1)$提供$2$ -近似解,并且无论底层距离度量如何,只要它是度量;然而,低于$2$ -因子是一个臭名昭著的挑战。我们考虑Ulam距离,这是众所周知的编辑距离度量的一个变体,其中字符串被限制为排列。对于这个指标,Chakraborty, Das和Krauthgamer [SODA, 2021]为$k=1$提供了一个$(2-\delta)$ -近似算法,其中$\delta\approx 2^{-40}$。我们的主要贡献是一个新的算法框架,用于聚类一组排列。我们的第一个结果是针对Ulam度量下的度量$k$ -中位数问题的$1.999$ -近似算法,该算法在$(k \log (nd))^{O(k)}n d^3$时间内运行由$[d]$上的$n$排列组成的输入。事实上,我们的框架足够强大,可以仅使用多对数($n$)空间将此结果扩展到流模型(其中$n$输入排列逐一到达)。此外,我们表明,即使在存在异常值的情况下,也可以获得类似的结果,这可能是一个更困难的问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信