Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems最新文献

New Algorithms for Monotone Classification 单调分类的新算法

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458324

Yufei Tao, Yu Wang

{"title":"New Algorithms for Monotone Classification","authors":"Yufei Tao, Yu Wang","doi":"10.1145/3452021.3458324","DOIUrl":"https://doi.org/10.1145/3452021.3458324","url":null,"abstract":"In em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p em dominates another point q if the coordinate of p is at least that of q on every dimension. A em monotone classifier is a function h mapping each d-dimensional point to $0, 1 $, subject to the condition that $h(p) ge h(q)$ holds whenever p dominates q. The classifier h em mis-classifies a point $p in P$ if $h(p)$ is different from the label of p. The em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to em probe (i.e., reveal) the label of a point in P. We prove that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + eps$ factor, while probing $tO(w/eps^2)$ labels, where w is the dominance width of P and $tO(.)$ hides a polylogarithmic factor. For constant $eps$, the probing cost matches an existing lower bound up to an $tO(1)$ factor. In the second em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129858584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Approximation Algorithms for Large Scale Data Analysis 大规模数据分析的近似算法

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458813

B. Saha

{"title":"Approximation Algorithms for Large Scale Data Analysis","authors":"B. Saha","doi":"10.1145/3452021.3458813","DOIUrl":"https://doi.org/10.1145/3452021.3458813","url":null,"abstract":"One of the greatest successes of computational complexity theory is the classification of countless fundamental computational problems into polynomial-time and NP-hard ones, two classes that are often referred to as tractable and intractable, respectively. However, this crude distinction of algorithmic efficiency is clearly insufficient when handling today's large scale of data. We need a finer-grained design and analysis of algorithms that pinpoints the exact exponent of polynomial running time, and a better understanding of when a speed-up is not possible. Based on stronger complexity assumptions than P vs NP, like the Strong Exponential Time Hypothesis, recently conditional lower bounds for a variety of fundamental problems in P have been proposed. Unfortunately, these conditional lower bounds often break down when one may settle for a near-optimal solution. Indeed, approximation algorithms can play a significant role when designing fast algorithms not just for traditional NP Hard problems, but also for polynomial time problems. For some applications arising in machine learning, the time complexity of the underlying algorithms is not sufficient to ensure a fast solution. It is often needed to collect side information about the data to ensure high accuracy. This requires low query complexity. In this presentation, we will cover new facets of fast algorithm design for large scale data analysis that emphasizes on the role of developing approximation algorithms for better polynomial time/query complexity.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126640121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Two-Attribute Skew Free, Isolated CP Theorem, and Massively Parallel Joins 二属性无偏、孤立CP定理和大规模并行连接

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458321

Miao Qiao, Yufei Tao

引用次数: 3

Cover or Pack: New Upper and Lower Bounds for Massively Parallel Joins 覆盖或包:大规模并行连接的新上限和下限

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458319

Xiao Hu

{"title":"Cover or Pack: New Upper and Lower Bounds for Massively Parallel Joins","authors":"Xiao Hu","doi":"10.1145/3452021.3458319","DOIUrl":"https://doi.org/10.1145/3452021.3458319","url":null,"abstract":"This paper considers the worst-case complexity of multi-round join evaluation in the Massively Parallel Computation (MPC) model. Unlike the sequential RAM model, in which there is a unified optimal algorithm based on the AGM bound for all join queries, worst-case optimal algorithms have been achieved on a very restrictive class of joins in the MPC model. The only known lower bound is still derived from the AGM bound, in terms of the optimal fractional edge covering number of the query. In this work, we make efforts towards bridging this gap. We design an instance-dependent algorithm for the class of α-acyclic join queries. In particular, when the maximum size of input relations is bounded, this complexity has a closed form in terms of the optimal fractional edge covering number of the query, which is worst-case optimal. Beyond acyclic joins, we surprisingly find that the optimal fractional edge covering number does not lead to a tight lower bound. More specifically, we prove for a class of cyclic joins a higher lower bound in terms of the optimal fractional edge packing number of the query, which is matched by existing algorithms, thus optimal. This new result displays a significant distinction for join evaluation, not only between acyclic and cyclic joins, but also between the fine-grained RAM and coarse-grained MPC model.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"66 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114125004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Deciding Boundedness of Monadic Sirups 一元群的有界性判定

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458332

S. Kikot, Á. Kurucz, V. Podolskii, M. Zakharyaschev

引用次数: 1

Probabilistic Databases under Updates: Boolean Query Evaluation and Ranked Enumeration 更新下的概率数据库:布尔查询求值和排序枚举

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458326

Christoph Berkholz, M. Merz

引用次数: 2

Stackless Processing of Streamed Trees 流树的无堆栈处理

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458320

Corentin Barloy, Filip Murlak, Charles Paperman

引用次数: 2

Datalog Unchained Datalog锁不住的

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458815

V. Vianu

引用次数: 5

Benchmarking Approximate Consistent Query Answering 近似一致查询回答的基准测试

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458309

M. Calautti, Marco Console, Andreas Pieris

{"title":"Benchmarking Approximate Consistent Query Answering","authors":"M. Calautti, Marco Console, Andreas Pieris","doi":"10.1145/3452021.3458309","DOIUrl":"https://doi.org/10.1145/3452021.3458309","url":null,"abstract":"Consistent query answering (CQA) aims to deliver meaningful answers when queries are evaluated over inconsistent databases. Such answers must be certainly true in all repairs, which are consistent databases whose difference from the inconsistent one is somehow minimal. Although CQA provides a clean framework for querying inconsistent databases, it is arguably more informative to compute the percentage of repairs in which a candidate answer is true, instead of simply saying that is true in all repairs, or is false in at least one repair. It should not be surprising, though, that computing this percentage is computationally hard. On the other hand, for practically relevant settings such as conjunctive queries and primary keys, there are data-efficient randomized approximation schemes for approximating this percentage. Our goal is to perform a thorough experimental evaluation and comparison of those approximation schemes. Our analysis provides new insights on which technique is indicated depending on key characteristics of the input, and it further provides evidence that making approximate CQA as described above feasible in practice is not an unrealistic goal.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127683956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Synchronization Schemas 同步模式

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458317

R. Alur, Phillip Hilliard, Z. Ives, Konstantinos Kallas, Konstantinos Mamouras, Filip Niksic, C. Stanford, V. Tannen, Anton Xue

{"title":"Synchronization Schemas","authors":"R. Alur, Phillip Hilliard, Z. Ives, Konstantinos Kallas, Konstantinos Mamouras, Filip Niksic, C. Stanford, V. Tannen, Anton Xue","doi":"10.1145/3452021.3458317","DOIUrl":"https://doi.org/10.1145/3452021.3458317","url":null,"abstract":"We present a type-theoretic framework for data stream processing for real-time decision making, where the desired computation involves a mix of sequential computation, such as smoothing and detection of peaks and surges, and naturally parallel computation, such as relational operations, key-based partitioning, and map-reduce. Our framework unifies sequential (ordered) and relational (unordered) data models. In particular, we define synchronization schemas as types, and series-parallel streams (SPS) as objects of these types. A synchronization schema imposes a hierarchical structure over relational types that succinctly captures ordering and synchronization requirements among different kinds of data items. Series-parallel streams naturally model objects such as relations, sequences, sequences of relations, sets of streams indexed by key values, time-based and event-based windows, and more complex structures obtained by nesting of these. We introduce series-parallel stream transformers (SPST) as a domain-specific language for modular specification of deterministic transformations over such streams. SPSTs provably specify only monotonic transformations allowing streamability, have a modular structure that can be exploited for correct parallel implementation, and are composable allowing specification of complex queries as a pipeline of transformations.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122819801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5