Minimum Coresets for Maxima Representation of Multidimensional Data

Yanhao Wang, M. Mathioudakis, Yuchen Li, K. Tan
{"title":"Minimum Coresets for Maxima Representation of Multidimensional Data","authors":"Yanhao Wang, M. Mathioudakis, Yuchen Li, K. Tan","doi":"10.1145/3452021.3458322","DOIUrl":null,"url":null,"abstract":"Coresets are succinct summaries of large datasets such that, for a given problem, the solution obtained from a coreset is provably competitive with the solution obtained from the full dataset. As such, coreset-based data summarization techniques have been successfully applied to various problems, e.g., geometric optimization, clustering, and approximate query processing, for scaling them up to massive data. In this paper, we study coresets for the maxima representation of multidimensional data: Given a set P of points in $ \\mathbbR ^d $, where d is a small constant, and an error parameter $ \\varepsilon \\in (0,1) $, a subset $ Q \\subseteq P $ is an $ \\varepsilon $-coreset for the maxima representation of P iff the maximum of Q is an $ \\varepsilon $-approximation of the maximum of P for any vector $ u \\in \\mathbbR ^d $, where the maximum is taken over the inner products between the set of points (P or Q) and u. We define a novel minimum $\\varepsilon$-coreset problem that asks for an $\\varepsilon$-coreset of the smallest size for the maxima representation of a point set. For the two-dimensional case, we develop an optimal polynomial-time algorithm for the minimum $ \\varepsilon $-coreset problem by transforming it into the shortest-cycle problem in a directed graph. Then, we prove that this problem is NP-hard in three or higher dimensions and present polynomial-time approximation algorithms in an arbitrary fixed dimension. Finally, we provide extensive experimental results on both real and synthetic datasets to demonstrate the superior performance of our proposed algorithms.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3452021.3458322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Coresets are succinct summaries of large datasets such that, for a given problem, the solution obtained from a coreset is provably competitive with the solution obtained from the full dataset. As such, coreset-based data summarization techniques have been successfully applied to various problems, e.g., geometric optimization, clustering, and approximate query processing, for scaling them up to massive data. In this paper, we study coresets for the maxima representation of multidimensional data: Given a set P of points in $ \mathbbR ^d $, where d is a small constant, and an error parameter $ \varepsilon \in (0,1) $, a subset $ Q \subseteq P $ is an $ \varepsilon $-coreset for the maxima representation of P iff the maximum of Q is an $ \varepsilon $-approximation of the maximum of P for any vector $ u \in \mathbbR ^d $, where the maximum is taken over the inner products between the set of points (P or Q) and u. We define a novel minimum $\varepsilon$-coreset problem that asks for an $\varepsilon$-coreset of the smallest size for the maxima representation of a point set. For the two-dimensional case, we develop an optimal polynomial-time algorithm for the minimum $ \varepsilon $-coreset problem by transforming it into the shortest-cycle problem in a directed graph. Then, we prove that this problem is NP-hard in three or higher dimensions and present polynomial-time approximation algorithms in an arbitrary fixed dimension. Finally, we provide extensive experimental results on both real and synthetic datasets to demonstrate the superior performance of our proposed algorithms.
多维数据的最大表示的最小核心集
核心集是大型数据集的简洁总结,对于给定的问题,从核心集获得的解决方案可证明与从完整数据集获得的解决方案具有竞争力。因此,基于核心集的数据摘要技术已经成功地应用于各种问题,例如几何优化、聚类和近似查询处理,以便将它们扩展到海量数据。在本文中,我们研究了多维数据的最大表示的核心集:给定$ \mathbbR ^d $中点的集合P,其中d是一个小常数,错误参数$ \varepsilon \in(0,1) $,如果Q的最大值是任意向量$ u \mathbbR ^d $中P的最大值的$ \varepsilon $-近似,则子集$ Q \subseteq P $是P的最大表示的$ \varepsilon $-coreset,其中最大值被取为点集(P或Q)与u之间的内积。我们定义了一个新的minimum $\varepsilon$-coreset问题,该问题要求为点集的最大表示提供最小大小的$\varepsilon$-coreset。对于二维情况,我们通过将最小$ \varepsilon $-coreset问题转化为有向图中的最短周期问题,开发了一个最优多项式时间算法。然后,我们证明了这个问题在三维或更高的维度上是np困难的,并在任意固定的维度上给出了多项式时间逼近算法。最后,我们在真实和合成数据集上提供了广泛的实验结果,以证明我们提出的算法的优越性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信