SIGMOD Rec.Pub Date : 2016-09-28DOI: 10.1145/3003665.3003674
Andrew Pavlo, Matthew Aslett
{"title":"What's Really New with NewSQL?","authors":"Andrew Pavlo, Matthew Aslett","doi":"10.1145/3003665.3003674","DOIUrl":"https://doi.org/10.1145/3003665.3003674","url":null,"abstract":"A new class of database management systems (DBMSs) called NewSQL tout their ability to scale modern on-line transaction processing (OLTP) workloads in a way that is not possible with legacy systems. The term NewSQL was first used by one of the authors of this article in a 2011 business analysis report discussing the rise of new database systems as challengers to these established vendors (Oracle, IBM, Microsoft). The other author was working on what became one of the first examples of a NewSQL DBMS. Since then several companies and research projects have used this term (rightly and wrongly) to describe their systems.\u0000 Given that relational DBMSs have been around for over four decades, it is justifiable to ask whether the claim of NewSQL's superiority is actually true or whether it is simply marketing. If they are indeed able to get better performance, then the next question is whether there is anything scientifically new about them that enables them to achieve these gains or is it just that hardware has advanced so much that now the bottlenecks from earlier years are no longer a problem.\u0000 To do this, we first discuss the history of databases to understand how NewSQL systems came about. We then provide a detailed explanation of what the term NewSQL means and the different categories of systems that fall under this definition.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"3 4 1","pages":"45-55"},"PeriodicalIF":0.0,"publicationDate":"2016-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78351394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIGMOD Rec.Pub Date : 2016-09-28DOI: 10.1145/3003665.3003667
Dan Olteanu, Maximilian Schleich
{"title":"Factorized Databases","authors":"Dan Olteanu, Maximilian Schleich","doi":"10.1145/3003665.3003667","DOIUrl":"https://doi.org/10.1145/3003665.3003667","url":null,"abstract":"This paper overviews factorized databases and their application to machine learning. The key observation underlying this work is that state-of-the-art relational query processing entails a high degree of redundancy in the computation and representation of query results. This redundancy can be avoided and is not necessary for subsequent analytics such as learning regression models.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"18 1","pages":"5-16"},"PeriodicalIF":0.0,"publicationDate":"2016-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80841269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIGMOD Rec.Pub Date : 2016-09-28DOI: 10.1145/3003665.3003672
Yihan Wang, Shaoxu Song, Lei Chen
{"title":"A Survey on Accessing Dataspaces","authors":"Yihan Wang, Shaoxu Song, Lei Chen","doi":"10.1145/3003665.3003672","DOIUrl":"https://doi.org/10.1145/3003665.3003672","url":null,"abstract":"Dataspaces provide a co-existence approach for heterogeneous data. Relationships among these heterogeneous data are often incrementally identified, such as object associations or attribute synonyms. With the different degree of relationships recognized, various query answers may be obtained. In this paper, we review the major techniques for processing and optimizing queries in dataspaces, according to their different abilities of handling relationships, including 1) simple search query without considering relationships, 2) association query over object associations, 3) heterogeneity query with attribute correspondences, and 4) similarity query for similar objects. Techniques such as indexing, query rewriting, expansion, and semantic query optimization are discussed for these query types. Finally, we highlight possible directions in accessing dataspaces.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"73 1","pages":"33-44"},"PeriodicalIF":0.0,"publicationDate":"2016-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79177048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIGMOD Rec.Pub Date : 2016-09-28DOI: 10.1145/3003665.3003676
M. Winslett, V. Braganholo
{"title":"H V Jagadish Speaks Out on PVLDB, CoRR and Data-driven Research","authors":"M. Winslett, V. Braganholo","doi":"10.1145/3003665.3003676","DOIUrl":"https://doi.org/10.1145/3003665.3003676","url":null,"abstract":"Welcome to ACM SIGMOD Record’s series of interviews with distinguished members of the database community. I’m Marianne Winslett, and today we are in Phoenix, cite of the 2012 SIGMOD and PODS conference. I have here with me H. V. Jagadish, who is the Bernard A. Galler Professor of Electrical Engineering and Computer Science at the University of Michigan. Jag has served as the editor-in-chief of the Proceedings of the VLDB, the database area editor for CoRR, and a board member for the Computing Research Association. Jag’s PhD is from Stanford University and he’s an ACM Fellow. So, welcome Jag!","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"28 1","pages":"56-62"},"PeriodicalIF":0.0,"publicationDate":"2016-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73426807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIGMOD Rec.Pub Date : 2016-08-01DOI: 10.14778/2994509.2994535
Pradap Konda, Sanjib Das, C. PaulSuganthanG., A. Doan, A. Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, J. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, V. Raghavendra
{"title":"Technical Perspective:: Toward Building Entity Matching Management Systems","authors":"Pradap Konda, Sanjib Das, C. PaulSuganthanG., A. Doan, A. Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, J. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, V. Raghavendra","doi":"10.14778/2994509.2994535","DOIUrl":"https://doi.org/10.14778/2994509.2994535","url":null,"abstract":"Entity matching (EM) has been a long-standing challenge in data management. Most current EM works focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then describe Magellan, a new kind of EM system. Magellan is novel in four important aspects. (1) It provides how-to guides that tell users what to do in each EM scenario, step by step. (2) It provides tools to help users execute these steps; the tools seek to cover the entire EM pipeline, not just blocking and matching as current EM systems do. (3) Tools are built into the Python open-source data science ecosystem, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick \"patching\" of the system. We describe research challenges and present extensive experiments that show the promise of the Magellan approach.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"118 1","pages":"33-40"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74896136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIGMOD Rec.Pub Date : 2016-06-02DOI: 10.1145/2949741.2949752
Jiexing Li, J. Naughton, Rimma V. Nehme
{"title":"Resource Bricolage for Parallel DBMSs on Heterogeneous Clusters","authors":"Jiexing Li, J. Naughton, Rimma V. Nehme","doi":"10.1145/2949741.2949752","DOIUrl":"https://doi.org/10.1145/2949741.2949752","url":null,"abstract":"Running parallel database systems in an environment with heterogeneous resources has become increasingly common, due to cluster evolution and increasing interest in moving applications into public clouds or shared infrastructures. For database systems running in a heterogeneous cluster, the default uniform data partitioning strategy may overload some of the slow machines while at the same time it may underutilize the more powerful machines. Since the processing time of a parallel query is determined by the slowest machine, such an allocation strategy may result in a significant query performance degradation.\u0000 We take a first step to address this problem by introducing a technique we call resource bricolage that improves database performance in heterogeneous environments. Our approach quantifies the performance differences among machines with various resources as they process workloads with diverse resource requirements. We formalize the problem of minimizing workload execution time and view it as an optimization problem, and then we employ linear programming to obtain a recommended data partitioning scheme. We verify the effectiveness of our technique with an extensive experimental study on a commercial database system.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"16 1","pages":"42-49"},"PeriodicalIF":0.0,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82774507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIGMOD Rec.Pub Date : 2016-06-02DOI: 10.1145/2949741.2949753
Z. Ives
{"title":"Technical Perspective: Implicit Parallelism through Deep Language Embedding","authors":"Z. Ives","doi":"10.1145/2949741.2949753","DOIUrl":"https://doi.org/10.1145/2949741.2949753","url":null,"abstract":"Modern “big data” analysis was motivated by the needs of the large Internet players, but it was enabled by two main technical developments: parallel data processing technologies that support reliable and scalable computation over unreliable shared-nothing clusters of computers, and continued advances in machine learning algorithms and techniques. Initial work on these two areas happened largely independently: MapReduce was developed for aggregate computations over large multitudes of records, with minimal control flow and no evident goal of supporting machine learning. Conversely, many of the advances in machine learning research targeted a single machine.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"20 1","pages":"50"},"PeriodicalIF":0.0,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84650750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIGMOD Rec.Pub Date : 2016-06-02DOI: 10.1145/2949741.2949743
J. Naughton
{"title":"Technical Perspective: Natural Language to SQL Translation by Iteratively Exploring a Middle Ground","authors":"J. Naughton","doi":"10.1145/2949741.2949743","DOIUrl":"https://doi.org/10.1145/2949741.2949743","url":null,"abstract":"A fundamental question in data management is how relational database management systems (RDBMSs) should be queried. Ideally, the query interface should be powerful enough to express arbitrary queries, yet simple enough to learn that users require virtually no training. Natural language is an obvious and appealing approach – presumably most users already know at least one natural language and use it to “query” other humans constantly. Unfortunately, employing natural language to query RDBMSs is highly nontrivial, and for the most part, not used. However, with the growing power and ubiquity of Natural Language Processing (NLP) systems, it makes sense to redouble efforts in applying NLP to database querying. At the most basic level, relational database systems are queried using SQL. (For that matter, most “NoSQL” systems are also queried using SQL.) SQL is very powerful and precise, and, for novices, very hard to write. So SQL cannot be used as a user interface for anyone but power users. Nonetheless, as the most widely used RDBMS query language, SQL is the most natural language into which to translate natural language questions over relational data. This translation is the focus of the following paper, “Understanding Natural Language Queries over Relational Databases”, by Li and Jagadish. The first important decision made by the authors of this paper is to reject a one-shot, one-way translation process from a natural language query to a corresponding SQL query. Instead, the authors advocate an iterative dialog between the person posing the query and the system building the relational query. This makes perfect sense – even in the much simpler world of keyword search systems, users iteratively refine their queries. Unfortunately, adopting this approach for RDBMS querying does not yield an easy problem – in fact, it uncovers a highly interesting and difficult challenge: how should the user and the system communicate in this iterative process? Answering this question is difficult. Unlike the case for keyword search systems, the answer to the query may not help the user know if the executed query was what they really wanted. For example, consider the simple query “find the difference between sales this year and last year.” In general the RDBMS will return a number – and it is very hard to tell just from that number if the query was correct or not. It would be far more precise for the system to respond to the user by presenting the generated SQL query itself. But this would require the person posing the natural language query to be able to read and understand SQL, which contradicts a major motivation for the system in the first place. Now we come to what is perhaps the heart of this paper: the decision to adopt an intermediate language the authors call “Query Tree,”a two-way domain-independent communication model allowing the user and system to understand one other. A query tree aids mapping a user query to its corresponding semantically correct SQL and ","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"35 1","pages":"5"},"PeriodicalIF":0.0,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78079353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIGMOD Rec.Pub Date : 2016-06-02DOI: 10.1145/2949741.2949751
D. DeWitt
{"title":"Technical Perspective: Taming Hardware Skew as Parallel DBMSs Scale Out","authors":"D. DeWitt","doi":"10.1145/2949741.2949751","DOIUrl":"https://doi.org/10.1145/2949741.2949751","url":null,"abstract":"For almost 40 years now, relational database management systems have successfully used data parallelism to speed up the evaluation of large queries. Here, by “data parallelism” we mean taking one operation (for example, a “join” or an “aggregation”) and spreading it over multiple machines, each operating on a part of the data. In general this approach works spectacularly well, yielding almost linear speedups over a wide variety of workloads. However, like any form of parallelism, data-parallel relational query processing is vulnerable to “skew.” The database literature is full of work dealing with the skew that arises when one node in a parallel system is allocated more work than the average. The following paper, by Li, Naughton, and Nehme, is interesting in that it deals with another kind of skew, one that has received much less attention: “hardware skew,” that is, skew that arises because the processing units in a parallel system are not all of equal power. Such skew can arise in several ways – for example, a parallel system could be constructed “on the fly” by allocating available nodes in a cloud, or a company could upgrade an on-premises system with the addition of new nodes that are of a different generation and class of hardware than the existing ones. If the DBMS is oblivious to the fact that the underlying system is not uniform, the result will be the same as that achieved if the system were constructed entirely of the slowest nodes in the system. If all the nodes in the system are equally “balanced” the solution is simple – if one node is 1/2 as fast as the average, give that node 1/2 the average work, and you are set. Unfortunately, in practice, things are not that simple. One node may have a faster CPU but the same I/O performance, or vice-versa; or nodes may have differing amounts of memory or network bandwidth. In such cases simple proportional allocation of work will be suboptimal. The situation is further complicated by the fact that different queries make different demands on the system with respect to CPU, memory, network, and disk; in fact, different stages of a single query can make very different demands. This, finally, is the situation addressed by the paper, “Resource Bricolage for Parallel DBMSs on Heterogeneous Clusters.” The authors make use of techniques for cost estimation growing out of the query optimization and query running time prediction literature; they combine these techniques with a linear programming model that chooses an optimal allocation for a given query on a given system. They demonstrate through an analytic model as well as experiments with an implementation that their proposed solution dominates simpler alternatives. An interesting question this work raises is the duality between “on-demand” load balancing of the type employed by MapReduce-like systems and the predictive, up-front allocation of work advocated by this paper. My suspicion is that both approaches have their place, and the choice of which ","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"94 1","pages":"41"},"PeriodicalIF":0.0,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88412266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIGMOD Rec.Pub Date : 2016-06-02DOI: 10.1145/2949741.2949744
Fei Li, H. Jagadish
{"title":"Understanding Natural Language Queries over Relational Databases","authors":"Fei Li, H. Jagadish","doi":"10.1145/2949741.2949744","DOIUrl":"https://doi.org/10.1145/2949741.2949744","url":null,"abstract":"Natural language has been the holy grail of query interface designers, but has generally been considered too hard to work with, except in limited specific circumstances. In this paper, we describe the architecture of an interactive natural language query interface for relational databases. Through a carefully limited interaction with the user, we are able to correctly interpret complex natural language queries, in a generic manner across a range of domains. By these means, a logically complex English language sentence is correctly translated into a SQL query, which may include aggregation, nesting, and various types of joins, among other things, and can be evaluated against an RDBMS.We have constructed a system, NaLIR (Natural Language Interface for Relational databases), embodying these ideas. Our experimental assessment, through user studies, demonstrates that NaLIR is good enough to be usable in practice: even naive users are able to specify quite complex ad-hoc queries.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"92 1","pages":"6-13"},"PeriodicalIF":0.0,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76523884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}