Semistructured Models, Queries and Algebras in the Big Data Era: Tutorial Summary

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI:10.1145/2882903.2912573

Y. Papakonstantinou

{"title":"Semistructured Models, Queries and Algebras in the Big Data Era: Tutorial Summary","authors":"Y. Papakonstantinou","doi":"10.1145/2882903.2912573","DOIUrl":null,"url":null,"abstract":"Numerous databases promoted as SQL-on-Hadoop, NewSQL and NoSQL support semi-structured, schemaless and heterogeneous data, typically in the form of enriched JSON. They also provide corresponding query languages. In addition to these genuine JSON databases, relational databases also provide special functions and language features for the support of JSON columns, typically piggybacking on non-1NF (non first normal form) features that SQL acquired over the years. We refer to SQL databases with JSON support as SQL/JSON databases. The evolving query languages present multiple variations: Some are superficial syntactic ones, while other ones are genuine differences in modeling, language capabilities and semantics. Incompatibility with SQL presents a learning challenge for genuine JSON databases, while the table orientation of SQL/JSON databases often leads to cumbersome syntactic/semantic structures that are contrary to the semistructured nature of JSON. Furthermore, the query languages often fall short of full-fledged semistructured query language capabilities, when compared to the yardstick set by XQuery and prior works on semistructured data (even after superficial model differences are abstracted out). We survey features, the designers' options and differences in the approaches taken by actual systems. In particular, we first present a SQL backwards-compatible language, named SQL++, which can access both SQL and JSON data. SQL++ is expected to be supported by Couchbase's CouchDB and UCI's AsterixDB semistructured databases. Then we expand SQL++ into the Configurable SQL++, whereas multiple possible (and different) semantics are formally captured by the multiple options that the language's semantic configuration options can take. We show how appropriate setting of the configuration options morphs the Configurable SQL++ semantics into the semantics of 10 surveyed languages, hence providing a compact and formal tool to understand the essential semantic differences between different systems. We briefly comment on the utility of formally capturing semantic variations in polystore systems. Finally we discuss the comparison with prior nested and semistructured query languages (notably OQL and XQuery) and describe a key aspect of query processor implementation: set-oriented semistructured query algebras. In particular, we transfer into the JSON era lessons from the semistructured query processing research of the 90s and 00s and combine them with insights on current JSON databases. Again, the tutorial presents the algebras' fundamentals while it abstracts away modeling differences that are not applicable.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"67 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2912573","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Numerous databases promoted as SQL-on-Hadoop, NewSQL and NoSQL support semi-structured, schemaless and heterogeneous data, typically in the form of enriched JSON. They also provide corresponding query languages. In addition to these genuine JSON databases, relational databases also provide special functions and language features for the support of JSON columns, typically piggybacking on non-1NF (non first normal form) features that SQL acquired over the years. We refer to SQL databases with JSON support as SQL/JSON databases. The evolving query languages present multiple variations: Some are superficial syntactic ones, while other ones are genuine differences in modeling, language capabilities and semantics. Incompatibility with SQL presents a learning challenge for genuine JSON databases, while the table orientation of SQL/JSON databases often leads to cumbersome syntactic/semantic structures that are contrary to the semistructured nature of JSON. Furthermore, the query languages often fall short of full-fledged semistructured query language capabilities, when compared to the yardstick set by XQuery and prior works on semistructured data (even after superficial model differences are abstracted out). We survey features, the designers' options and differences in the approaches taken by actual systems. In particular, we first present a SQL backwards-compatible language, named SQL++, which can access both SQL and JSON data. SQL++ is expected to be supported by Couchbase's CouchDB and UCI's AsterixDB semistructured databases. Then we expand SQL++ into the Configurable SQL++, whereas multiple possible (and different) semantics are formally captured by the multiple options that the language's semantic configuration options can take. We show how appropriate setting of the configuration options morphs the Configurable SQL++ semantics into the semantics of 10 surveyed languages, hence providing a compact and formal tool to understand the essential semantic differences between different systems. We briefly comment on the utility of formally capturing semantic variations in polystore systems. Finally we discuss the comparison with prior nested and semistructured query languages (notably OQL and XQuery) and describe a key aspect of query processor implementation: set-oriented semistructured query algebras. In particular, we transfer into the JSON era lessons from the semistructured query processing research of the 90s and 00s and combine them with insights on current JSON databases. Again, the tutorial presents the algebras' fundamentals while it abstracts away modeling differences that are not applicable.

查看原文本刊更多论文

大数据时代的半结构化模型、查询和代数:教程摘要

许多被推广为SQL-on-Hadoop、NewSQL和NoSQL的数据库支持半结构化、无模式和异构数据，通常以丰富的JSON形式出现。它们还提供相应的查询语言。除了这些真正的JSON数据库之外，关系数据库还为支持JSON列提供了特殊的函数和语言特性，通常附带SQL多年来获得的非1nf(非第一范式)特性。我们将支持JSON的SQL数据库称为SQL/JSON数据库。不断发展的查询语言呈现出多种变化:一些是表面的语法变化，而另一些则是在建模、语言能力和语义方面的真正差异。与SQL的不兼容性对真正的JSON数据库提出了一个学习挑战，而SQL/JSON数据库的面向表通常会导致繁琐的语法/语义结构，这与JSON的半结构化性质相反。此外，与XQuery设置的标准和以前在半结构化数据上的工作相比(即使抽象出表面的模型差异)，查询语言往往缺乏成熟的半结构化查询语言功能。我们调查的特点，设计师的选择和实际系统所采取的方法的差异。特别地，我们首先提出了一种SQL向后兼容的语言，名为SQL++，它可以访问SQL和JSON数据。SQL++有望得到Couchbase的CouchDB和UCI的AsterixDB半结构化数据库的支持。然后我们将SQL++扩展为可配置的SQL++，而语言的语义配置选项可以采用的多个选项正式捕获多个可能的(和不同的)语义。我们展示了配置选项的适当设置如何将可配置的SQL++语义转换为10种调查语言的语义，从而提供了一个紧凑而正式的工具来理解不同系统之间的基本语义差异。我们简要地评论了在多存储系统中正式捕获语义变化的效用。最后，我们讨论了与先前嵌套和半结构化查询语言(特别是OQL和XQuery)的比较，并描述了查询处理器实现的一个关键方面:面向集合的半结构化查询代数。特别地，我们从90年代和00年代的半结构化查询处理研究中吸取了JSON时代的经验教训，并将它们与对当前JSON数据库的见解结合起来。同样，本教程介绍了代数的基础知识，同时抽象了不适用的建模差异。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 International Conference on Management of Data

自引率

0.00%

发文量