{"title":"New Algorithms for Monotone Classification","authors":"Yufei Tao, Yu Wang","doi":"10.1145/3452021.3458324","DOIUrl":"https://doi.org/10.1145/3452021.3458324","url":null,"abstract":"In em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p em dominates another point q if the coordinate of p is at least that of q on every dimension. A em monotone classifier is a function h mapping each d-dimensional point to $0, 1 $, subject to the condition that $h(p) ge h(q)$ holds whenever p dominates q. The classifier h em mis-classifies a point $p in P$ if $h(p)$ is different from the label of p. The em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to em probe (i.e., reveal) the label of a point in P. We prove that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + eps$ factor, while probing $tO(w/eps^2)$ labels, where w is the dominance width of P and $tO(.)$ hides a polylogarithmic factor. For constant $eps$, the probing cost matches an existing lower bound up to an $tO(1)$ factor. In the second em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129858584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Approximation Algorithms for Large Scale Data Analysis","authors":"B. Saha","doi":"10.1145/3452021.3458813","DOIUrl":"https://doi.org/10.1145/3452021.3458813","url":null,"abstract":"One of the greatest successes of computational complexity theory is the classification of countless fundamental computational problems into polynomial-time and NP-hard ones, two classes that are often referred to as tractable and intractable, respectively. However, this crude distinction of algorithmic efficiency is clearly insufficient when handling today's large scale of data. We need a finer-grained design and analysis of algorithms that pinpoints the exact exponent of polynomial running time, and a better understanding of when a speed-up is not possible. Based on stronger complexity assumptions than P vs NP, like the Strong Exponential Time Hypothesis, recently conditional lower bounds for a variety of fundamental problems in P have been proposed. Unfortunately, these conditional lower bounds often break down when one may settle for a near-optimal solution. Indeed, approximation algorithms can play a significant role when designing fast algorithms not just for traditional NP Hard problems, but also for polynomial time problems. For some applications arising in machine learning, the time complexity of the underlying algorithms is not sufficient to ensure a fast solution. It is often needed to collect side information about the data to ensure high accuracy. This requires low query complexity. In this presentation, we will cover new facets of fast algorithm design for large scale data analysis that emphasizes on the role of developing approximation algorithms for better polynomial time/query complexity.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126640121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two-Attribute Skew Free, Isolated CP Theorem, and Massively Parallel Joins","authors":"Miao Qiao, Yufei Tao","doi":"10.1145/3452021.3458321","DOIUrl":"https://doi.org/10.1145/3452021.3458321","url":null,"abstract":"This paper presents an algorithm to process a multi-way join with load $tO(n/p^2/(α φ) )$ under the MPC model, where n is the number of tuples in the input relations, α the maximum arity of those relations, p the number of machines, and φ a newly introduced parameter called the em generalized vertex packing number. The algorithm owes to two new findings. The first is a em two-attribute skew free technique to partition the join result for parallel computation. The second is an em isolated cartesian product theorem, which provides fresh graph-theoretic insights on joins with α ge 3$ and generalizes an existing theorem on α = 2$.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133465356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cover or Pack: New Upper and Lower Bounds for Massively Parallel Joins","authors":"Xiao Hu","doi":"10.1145/3452021.3458319","DOIUrl":"https://doi.org/10.1145/3452021.3458319","url":null,"abstract":"This paper considers the worst-case complexity of multi-round join evaluation in the Massively Parallel Computation (MPC) model. Unlike the sequential RAM model, in which there is a unified optimal algorithm based on the AGM bound for all join queries, worst-case optimal algorithms have been achieved on a very restrictive class of joins in the MPC model. The only known lower bound is still derived from the AGM bound, in terms of the optimal fractional edge covering number of the query. In this work, we make efforts towards bridging this gap. We design an instance-dependent algorithm for the class of α-acyclic join queries. In particular, when the maximum size of input relations is bounded, this complexity has a closed form in terms of the optimal fractional edge covering number of the query, which is worst-case optimal. Beyond acyclic joins, we surprisingly find that the optimal fractional edge covering number does not lead to a tight lower bound. More specifically, we prove for a class of cyclic joins a higher lower bound in terms of the optimal fractional edge packing number of the query, which is matched by existing algorithms, thus optimal. This new result displays a significant distinction for join evaluation, not only between acyclic and cyclic joins, but also between the fine-grained RAM and coarse-grained MPC model.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"66 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114125004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Kikot, Á. Kurucz, V. Podolskii, M. Zakharyaschev
{"title":"Deciding Boundedness of Monadic Sirups","authors":"S. Kikot, Á. Kurucz, V. Podolskii, M. Zakharyaschev","doi":"10.1145/3452021.3458332","DOIUrl":"https://doi.org/10.1145/3452021.3458332","url":null,"abstract":"We show that deciding boundedness (aka FO-rewritability) of monadic single rule datalog programs (sirups) is 2Exp-hard, which matches the upper bound known since 1988 and finally settles a long-standing open problem. We obtain this result as a byproduct of an attempt to classify monadic 'disjunctive sirups'---Boolean conjunctive queries $q$ with unary and binary predicates mediated by a disjunctive rule $T(x) łor F(x) łeftarrow A(x)$---according to the data complexity of their evaluation. Apart from establishing that deciding FO-rewritability of disjunctive sirups with a dag-shaped $q$ is also 2Exp-hard, we make substantial progress towards obtaining a complete FO/Ł-hardness dichotomy of disjunctive sirups with ditree-shaped $q$.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129397897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Probabilistic Databases under Updates: Boolean Query Evaluation and Ranked Enumeration","authors":"Christoph Berkholz, M. Merz","doi":"10.1145/3452021.3458326","DOIUrl":"https://doi.org/10.1145/3452021.3458326","url":null,"abstract":"We consider tuple-independent probabilistic databases in a dynamic setting, where tuples can be inserted or deleted. In this context we are interested in efficient data structures for maintaining the query result of Boolean as well as non-Boolean queries. For Boolean queries, we show how the known lifted inference rules can be made dynamic, so that they support single-tuple updates with only a constant number of arithmetic operations. As a consequence, we obtain that the probability of every safe UCQ can be maintained with constant update time. For non-Boolean queries, our task is to enumerate all result tuples ranked by their probability. We develop lifted inference rules for non-Boolean queries, and, based on these rules, provide a dynamic data structure that allows both log-time updates and ranked enumeration with logarithmic delay. As an application, we identify a fragment of non-repeating conjunctive queries that supports log-time updates as well as log-delay ranked enumeration. This characterisation is tight under the OMv-conjecture.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131539482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stackless Processing of Streamed Trees","authors":"Corentin Barloy, Filip Murlak, Charles Paperman","doi":"10.1145/3452021.3458320","DOIUrl":"https://doi.org/10.1145/3452021.3458320","url":null,"abstract":"Processing tree-structured data in the streaming model is a challenge: capturing regular properties of streamed trees by means of a stack is costly in memory, but falling back to finite-state automata drastically limits the computational power. We propose an intermediate stackless model based on register automata equipped with a single counter, used to maintain the current depth in the tree. We explore the power of this model to validate and query streamed trees. Our main result is an effective characterization of regular path queries (RPQs) that can be evaluated stacklessly---with and without registers. In particular, we confirm the conjectured characterization of tree languages defined by DTDs that are recognizable without registers, by Segoufin and Vianu (2002), in the special case of tree languages defined by means of an RPQ.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123929039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Datalog Unchained","authors":"V. Vianu","doi":"10.1145/3452021.3458815","DOIUrl":"https://doi.org/10.1145/3452021.3458815","url":null,"abstract":"This is the companion paper of a talk in the Gems of PODS series, that reviews the development, starting at PODS 1988, of a family of Datalog-like languages with procedural, forward chaining semantics, providing an alternative to the classical declarative, model-theoretic semantics. These languages also provide a unified formalism that can express important classes of queries including fixpoint, while, and all computable queries. They can also incorporate in a natural fashion updates and nondeterminism. Datalog variants with forward chaining semantics have been adopted in a variety of settings, including active databases, production systems, distributed data exchange, and data-driven reactive systems.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134061260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Benchmarking Approximate Consistent Query Answering","authors":"M. Calautti, Marco Console, Andreas Pieris","doi":"10.1145/3452021.3458309","DOIUrl":"https://doi.org/10.1145/3452021.3458309","url":null,"abstract":"Consistent query answering (CQA) aims to deliver meaningful answers when queries are evaluated over inconsistent databases. Such answers must be certainly true in all repairs, which are consistent databases whose difference from the inconsistent one is somehow minimal. Although CQA provides a clean framework for querying inconsistent databases, it is arguably more informative to compute the percentage of repairs in which a candidate answer is true, instead of simply saying that is true in all repairs, or is false in at least one repair. It should not be surprising, though, that computing this percentage is computationally hard. On the other hand, for practically relevant settings such as conjunctive queries and primary keys, there are data-efficient randomized approximation schemes for approximating this percentage. Our goal is to perform a thorough experimental evaluation and comparison of those approximation schemes. Our analysis provides new insights on which technique is indicated depending on key characteristics of the input, and it further provides evidence that making approximate CQA as described above feasible in practice is not an unrealistic goal.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127683956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Alur, Phillip Hilliard, Z. Ives, Konstantinos Kallas, Konstantinos Mamouras, Filip Niksic, C. Stanford, V. Tannen, Anton Xue
{"title":"Synchronization Schemas","authors":"R. Alur, Phillip Hilliard, Z. Ives, Konstantinos Kallas, Konstantinos Mamouras, Filip Niksic, C. Stanford, V. Tannen, Anton Xue","doi":"10.1145/3452021.3458317","DOIUrl":"https://doi.org/10.1145/3452021.3458317","url":null,"abstract":"We present a type-theoretic framework for data stream processing for real-time decision making, where the desired computation involves a mix of sequential computation, such as smoothing and detection of peaks and surges, and naturally parallel computation, such as relational operations, key-based partitioning, and map-reduce. Our framework unifies sequential (ordered) and relational (unordered) data models. In particular, we define synchronization schemas as types, and series-parallel streams (SPS) as objects of these types. A synchronization schema imposes a hierarchical structure over relational types that succinctly captures ordering and synchronization requirements among different kinds of data items. Series-parallel streams naturally model objects such as relations, sequences, sequences of relations, sets of streams indexed by key values, time-based and event-based windows, and more complex structures obtained by nesting of these. We introduce series-parallel stream transformers (SPST) as a domain-specific language for modular specification of deterministic transformations over such streams. SPSTs provably specify only monotonic transformations allowing streamability, have a modular structure that can be exploited for correct parallel implementation, and are composable allowing specification of complex queries as a pipeline of transformations.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122819801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}