{"title":"DaskDB中的可扩展空间分析和原位查询处理","authors":"Suvam Kumar Das, Ronnit Peter, S. Ray","doi":"10.1145/3609956.3609978","DOIUrl":null,"url":null,"abstract":"Vast amounts of data are stored in raw data files. Data scientists and practitioners typically use data science frameworks for data analysis on raw data. Among them, Python Pandas library is one of the most popular language-based frameworks. On the other hand, relational databases (RDBMSs) are still widely used for SQL query execution. Before querying, raw data must be loaded into RDBMSs through an ETL process. Conversely, data stored in RDBMSs may need to be exported out or moved into a suitable format to perform complex data analysis. This movement of data adversely affects the time-to-insight. Recently a scalable system, called DaskDB, was introduced, which supports unified data analytics and in situ SQL query processing without requiring any data movement. It supports invoking existing Python API’s as User-Defined Functions (UDF) as a part of SQL queries, so they can be easily integrated with most of the existing Python applications. Due to the importance of supporting spatial analytics and spatial SQL queries, we have extended DaskDB to support spatial functionalities. In this paper, we present our enhanced DaskDB system. With two real-world spatial datasets, we demonstrate the scalability of DaskDB’s spatial features.","PeriodicalId":274777,"journal":{"name":"Proceedings of the 18th International Symposium on Spatial and Temporal Data","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scalable Spatial Analytics and In Situ Query Processing in DaskDB\",\"authors\":\"Suvam Kumar Das, Ronnit Peter, S. Ray\",\"doi\":\"10.1145/3609956.3609978\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vast amounts of data are stored in raw data files. Data scientists and practitioners typically use data science frameworks for data analysis on raw data. Among them, Python Pandas library is one of the most popular language-based frameworks. On the other hand, relational databases (RDBMSs) are still widely used for SQL query execution. Before querying, raw data must be loaded into RDBMSs through an ETL process. Conversely, data stored in RDBMSs may need to be exported out or moved into a suitable format to perform complex data analysis. This movement of data adversely affects the time-to-insight. Recently a scalable system, called DaskDB, was introduced, which supports unified data analytics and in situ SQL query processing without requiring any data movement. It supports invoking existing Python API’s as User-Defined Functions (UDF) as a part of SQL queries, so they can be easily integrated with most of the existing Python applications. Due to the importance of supporting spatial analytics and spatial SQL queries, we have extended DaskDB to support spatial functionalities. In this paper, we present our enhanced DaskDB system. With two real-world spatial datasets, we demonstrate the scalability of DaskDB’s spatial features.\",\"PeriodicalId\":274777,\"journal\":{\"name\":\"Proceedings of the 18th International Symposium on Spatial and Temporal Data\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 18th International Symposium on Spatial and Temporal Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3609956.3609978\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Symposium on Spatial and Temporal Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3609956.3609978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Scalable Spatial Analytics and In Situ Query Processing in DaskDB
Vast amounts of data are stored in raw data files. Data scientists and practitioners typically use data science frameworks for data analysis on raw data. Among them, Python Pandas library is one of the most popular language-based frameworks. On the other hand, relational databases (RDBMSs) are still widely used for SQL query execution. Before querying, raw data must be loaded into RDBMSs through an ETL process. Conversely, data stored in RDBMSs may need to be exported out or moved into a suitable format to perform complex data analysis. This movement of data adversely affects the time-to-insight. Recently a scalable system, called DaskDB, was introduced, which supports unified data analytics and in situ SQL query processing without requiring any data movement. It supports invoking existing Python API’s as User-Defined Functions (UDF) as a part of SQL queries, so they can be easily integrated with most of the existing Python applications. Due to the importance of supporting spatial analytics and spatial SQL queries, we have extended DaskDB to support spatial functionalities. In this paper, we present our enhanced DaskDB system. With two real-world spatial datasets, we demonstrate the scalability of DaskDB’s spatial features.