DaskDB:具有统一数据分析和原位查询处理的可扩展数据科学

2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA) Pub Date : 2021-10-06 DOI:10.1109/DSAA53316.2021.9564218

A. Watson, Suvam Kumar Das, S. Ray

{"title":"DaskDB:具有统一数据分析和原位查询处理的可扩展数据科学","authors":"A. Watson, Suvam Kumar Das, S. Ray","doi":"10.1109/DSAA53316.2021.9564218","DOIUrl":null,"url":null,"abstract":"Due to the rapidly rising data volume, there is a need to analyze this data efficiently and produce results quickly. However, data scientists today need to use different systems, since presently relational databases are primarily used for SQL querying and data science frameworks for complex data analysis. This may incur significant movement of data across multiple systems, which is expensive. Furthermore, with relational databases, the data must be completely loaded into the database before performing any analysis. We believe that data scientists would prefer to use a single system to perform both data analysis tasks and SQL querying, without requiring data movement between different systems. Ideally, this system would offer adequate performance, scalability, built-in data analysis functionalities, and usability. We present DaskDB, a scalable data science system with support for unified data analytics and in situ SQL query processing on heterogeneous data sources. DaskDB supports invoking Python APIs as User-Defined Functions (UDF). So, it can be easily integrated with most existing Python data science applications. Moreover, we introduce a distributed index join algorithm and a novel distributed learned index to improve join performance. Our experimental evaluation involve the TPC-H benchmark and a custom UDF benchmark, which we developed, for data analytics. And, we demonstrate that DaskDB significantly outperforms PySpark and Hive/Hivemall.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"DaskDB: Scalable Data Science with Unified Data Analytics and In Situ Query Processing\",\"authors\":\"A. Watson, Suvam Kumar Das, S. Ray\",\"doi\":\"10.1109/DSAA53316.2021.9564218\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to the rapidly rising data volume, there is a need to analyze this data efficiently and produce results quickly. However, data scientists today need to use different systems, since presently relational databases are primarily used for SQL querying and data science frameworks for complex data analysis. This may incur significant movement of data across multiple systems, which is expensive. Furthermore, with relational databases, the data must be completely loaded into the database before performing any analysis. We believe that data scientists would prefer to use a single system to perform both data analysis tasks and SQL querying, without requiring data movement between different systems. Ideally, this system would offer adequate performance, scalability, built-in data analysis functionalities, and usability. We present DaskDB, a scalable data science system with support for unified data analytics and in situ SQL query processing on heterogeneous data sources. DaskDB supports invoking Python APIs as User-Defined Functions (UDF). So, it can be easily integrated with most existing Python data science applications. Moreover, we introduce a distributed index join algorithm and a novel distributed learned index to improve join performance. Our experimental evaluation involve the TPC-H benchmark and a custom UDF benchmark, which we developed, for data analytics. And, we demonstrate that DaskDB significantly outperforms PySpark and Hive/Hivemall.\",\"PeriodicalId\":129612,\"journal\":{\"name\":\"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSAA53316.2021.9564218\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA53316.2021.9564218","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

由于数据量的快速增长，需要高效地分析这些数据并快速产生结果。然而，今天的数据科学家需要使用不同的系统，因为目前关系数据库主要用于SQL查询，而数据科学框架主要用于复杂的数据分析。这可能会导致数据在多个系统之间的大量移动，这是非常昂贵的。此外，对于关系数据库，必须在执行任何分析之前将数据完全加载到数据库中。我们相信数据科学家更喜欢使用单一系统来执行数据分析任务和SQL查询，而不需要在不同的系统之间移动数据。理想情况下，该系统将提供足够的性能、可伸缩性、内置数据分析功能和可用性。我们提出了DaskDB，一个可扩展的数据科学系统，支持统一的数据分析和异构数据源上的原位SQL查询处理。DaskDB支持以UDF (User-Defined Functions)方式调用Python api。因此，它可以很容易地与大多数现有的Python数据科学应用程序集成。此外，我们还引入了一种分布式索引连接算法和一种新的分布式学习索引来提高连接性能。我们的实验评估涉及TPC-H基准测试和我们开发的用于数据分析的定制UDF基准测试。并且，我们证明了DaskDB明显优于PySpark和Hive/Hivemall。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DaskDB: Scalable Data Science with Unified Data Analytics and In Situ Query Processing

Due to the rapidly rising data volume, there is a need to analyze this data efficiently and produce results quickly. However, data scientists today need to use different systems, since presently relational databases are primarily used for SQL querying and data science frameworks for complex data analysis. This may incur significant movement of data across multiple systems, which is expensive. Furthermore, with relational databases, the data must be completely loaded into the database before performing any analysis. We believe that data scientists would prefer to use a single system to perform both data analysis tasks and SQL querying, without requiring data movement between different systems. Ideally, this system would offer adequate performance, scalability, built-in data analysis functionalities, and usability. We present DaskDB, a scalable data science system with support for unified data analytics and in situ SQL query processing on heterogeneous data sources. DaskDB supports invoking Python APIs as User-Defined Functions (UDF). So, it can be easily integrated with most existing Python data science applications. Moreover, we introduce a distributed index join algorithm and a novel distributed learned index to improve join performance. Our experimental evaluation involve the TPC-H benchmark and a custom UDF benchmark, which we developed, for data analytics. And, we demonstrate that DaskDB significantly outperforms PySpark and Hive/Hivemall.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)

自引率

0.00%

发文量