Spark SQL: Relational Data Processing in Spark

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI:10.1145/2723372.2742797

Michael Armbrust, Reynold Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, M. Franklin, A. Ghodsi, M. Zaharia

{"title":"Spark SQL: Relational Data Processing in Spark","authors":"Michael Armbrust, Reynold Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, M. Franklin, A. Ghodsi, M. Zaharia","doi":"10.1145/2723372.2742797","DOIUrl":null,"url":null,"abstract":"Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1296","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2723372.2742797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1296

Abstract

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

查看原文本刊更多论文

Spark SQL:关系型数据处理

Spark SQL是Apache Spark中的一个新模块，它将关系处理与Spark的函数式编程API集成在一起。基于我们使用Shark的经验，Spark SQL可以让Spark程序员利用关系处理的优势(例如声明式查询和优化存储)，并允许SQL用户调用Spark中的复杂分析库(例如机器学习)。与以前的系统相比，Spark SQL主要增加了两个功能。首先，它通过与过程化Spark代码集成的声明式DataFrame API，在关系处理和过程化处理之间提供了更紧密的集成。其次，它包含一个高度可扩展的优化器Catalyst，它使用Scala编程语言的特性构建，这使得添加可组合规则、控制代码生成和定义扩展点变得容易。使用Catalyst，我们已经为现代数据分析的复杂需求量身定制了各种功能(例如JSON的模式推断、机器学习类型和对外部数据库的查询联合)。我们把Spark SQL看作是SQL-on-Spark和Spark本身的进化，提供了更丰富的api和优化，同时保留了Spark编程模型的优点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量