Compile-Time Code Generation for Embedded Data-Intensive Query Languages

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI:10.1109/BigDataCongress.2018.00008

L. Fegaras, Md Hasanuzzaman Noor

{"title":"Compile-Time Code Generation for Embedded Data-Intensive Query Languages","authors":"L. Fegaras, Md Hasanuzzaman Noor","doi":"10.1109/BigDataCongress.2018.00008","DOIUrl":null,"url":null,"abstract":"Many emerging Big Data programming environments, such as Spark and Flink, provide powerful APIs that are inspired by functional programming. However, because of the complexity involved in developing and fine-tuning data analysis applications using the provided APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, current data analysis query languages, which are typically based on the relational model, cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model, and are checked for correctness at run-time, which results in a significantly longer program development time. To address these shortcomings, we introduce a new query language for data-intensive scalable computing, called DIQL, that is deeply embedded in Scala, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer can find any possible join in a query, including joins hidden across deeply nested queries, thus unnesting any form of query nesting. Currently, DIQL can run on three Big Data platforms: Apache Spark, Apache Flink, and Twitter's Cascading/Scalding.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2018.00008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Many emerging Big Data programming environments, such as Spark and Flink, provide powerful APIs that are inspired by functional programming. However, because of the complexity involved in developing and fine-tuning data analysis applications using the provided APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, current data analysis query languages, which are typically based on the relational model, cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model, and are checked for correctness at run-time, which results in a significantly longer program development time. To address these shortcomings, we introduce a new query language for data-intensive scalable computing, called DIQL, that is deeply embedded in Scala, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer can find any possible join in a query, including joins hidden across deeply nested queries, thus unnesting any form of query nesting. Currently, DIQL can run on three Big Data platforms: Apache Spark, Apache Flink, and Twitter's Cascading/Scalding.

查看原文本刊更多论文

嵌入式数据密集型查询语言的编译时代码生成

许多新兴的大数据编程环境，如Spark和Flink，都提供了受函数式编程启发的强大api。然而，由于使用所提供的api开发和微调数据分析应用程序的复杂性，许多程序员更喜欢使用声明性语言，如Hive和Spark SQL，来编写他们的分布式应用程序。不幸的是，当前的数据分析查询语言通常基于关系模型，不能有效地捕获复杂数据分析应用程序所需的丰富数据类型和计算。此外，这些查询语言不能很好地与宿主编程语言集成，因为它们基于不兼容的数据模型，并且在运行时检查正确性，这导致程序开发时间明显延长。为了解决这些缺点，我们引入了一种新的用于数据密集型可扩展计算的查询语言，称为DIQL，它深深嵌入在Scala中，以及一个查询优化框架，该框架在编译时优化DIQL查询并将其转换为字节码。与其他查询语言相比，我们的查询嵌入消除了阻抗不匹配，因为任何Scala代码都可以与类似sql的语法无缝混合，而无需添加任何特殊声明。DIQL支持嵌套集合和分层数据，并允许在查询中的任何位置嵌套查询。有了DIQL，程序员可以只使用类似sql的语法来表达复杂的数据分析任务，比如PageRank和矩阵分解。DIQL查询优化器可以在查询中找到任何可能的连接，包括隐藏在深嵌套查询中的连接，从而取消任何形式的查询嵌套。目前，DIQL可以在三个大数据平台上运行:Apache Spark、Apache Flink和Twitter的Cascading/Scalding。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE International Congress on Big Data (BigData Congress)

自引率

0.00%

发文量