LotusSQL: SQL engine for high-performance big data systems

IF 6.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics Pub Date : 2021-08-26 DOI:10.26599/BDMA.2021.9020009

Xiaohan Li;Bowen Yu;Guanyu Feng;Haojie Wang;Wenguang Chen

{"title":"LotusSQL: SQL engine for high-performance big data systems","authors":"Xiaohan Li;Bowen Yu;Guanyu Feng;Haojie Wang;Wenguang Chen","doi":"10.26599/BDMA.2021.9020009","DOIUrl":null,"url":null,"abstract":"In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"252-265"},"PeriodicalIF":6.2000,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523499.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Mining and Analytics","FirstCategoryId":"1093","ListUrlMain":"https://ieeexplore.ieee.org/document/9523499/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.

查看原文本刊更多论文

LotusSQL：用于高性能大数据系统的SQL引擎

近年来，Apache Spark已成为大数据处理的事实标准。SparkSQL是一个使用结构化查询语言（SQL）在Spark上提供关系分析支持的模块。SparkSQL提供了方便的数据处理接口。尽管SparkSQL具有高效的优化器，但由于Java虚拟机和不必要的数据序列化和反序列化，Spark的效率仍然很低。采用C++等原生语言有助于避免此类瓶颈。得益于裸机运行时环境和模板使用，具有C++接口的系统通常可以获得卓越的性能。然而，本机语言的复杂性也增加了所需的编程和调试工作量。在这项工作中，我们介绍了LotusSQL，这是一个在本地后端Lotus上为数据集抽象提供SQL支持的引擎。我们采用了一个方便的SQL处理框架来处理前端作业。添加了先进的查询优化技术，以提高执行计划的质量。在计算引擎的存储设计和用户界面之上，LotusSQL高效地实现了一组结构化数据集操作，并将其与前端集成。评估结果表明，LotusSQL在某些查询中的加速率高达9倍，在标准查询基准测试中平均比Spark SQL高出2倍以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data Mining and Analytics Computer Science-Computer Science Applications

CiteScore

20.90

自引率

2.20%

发文量

期刊介绍： Big Data Mining and Analytics, a publication by Tsinghua University Press, presents groundbreaking research in the field of big data research and its applications. This comprehensive book delves into the exploration and analysis of vast amounts of data from diverse sources to uncover hidden patterns, correlations, insights, and knowledge. Featuring the latest developments, research issues, and solutions, this book offers valuable insights into the world of big data. It provides a deep understanding of data mining techniques, data analytics, and their practical applications. Big Data Mining and Analytics has gained significant recognition and is indexed and abstracted in esteemed platforms such as ESCI, EI, Scopus, DBLP Computer Science, Google Scholar, INSPEC, CSCD, DOAJ, CNKI, and more. With its wealth of information and its ability to transform the way we perceive and utilize data, this book is a must-read for researchers, professionals, and anyone interested in the field of big data analytics.