Masahiro Matsui, Takuto Sugisaki, Kensaku Okada, N. Koshizuka
{"title":"AlphaSQL: Open Source Software Tool for Automatic Dependency Resolution, Parallelization and Validation for SQL and Data","authors":"Masahiro Matsui, Takuto Sugisaki, Kensaku Okada, N. Koshizuka","doi":"10.1109/icdew55742.2022.00010","DOIUrl":null,"url":null,"abstract":"Improved performance of database systems has enabled faster SQL querying and more complex data processing. However, as the data becomes more complex and larger, SQL data processing becomes more difficult and costly. Typical problems include changing SQL queries and data schema resolution in complex dependencies by hand. In addition, human errors can lead to complex cyclic dependency problems. To mitigate these problems, we developed AlphaSQL: an open-source software tool for SQL data processing. AlphaSQL mainly supports three techniques to automate data preparation by SQL: (1) extracting a directed acyclic graph (DAG) based on dependencies between SQL and data, (2) validating the schema included in the whole DAG, and (3) parallelizing the queries based on the DAG. We applied AlphaSQL to a real-world data analysis and machine learning project where we analyzed 1445 logs obtained from static validation for git commits and 3243 execution logs. Our analysis showed that AlphaSQL detected various errors with high precision and recall, part of which existing tools could not catch (e.g., missing resources and schema mismatches). AlphaSQL would enable more maintainable data management using SQL.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icdew55742.2022.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Improved performance of database systems has enabled faster SQL querying and more complex data processing. However, as the data becomes more complex and larger, SQL data processing becomes more difficult and costly. Typical problems include changing SQL queries and data schema resolution in complex dependencies by hand. In addition, human errors can lead to complex cyclic dependency problems. To mitigate these problems, we developed AlphaSQL: an open-source software tool for SQL data processing. AlphaSQL mainly supports three techniques to automate data preparation by SQL: (1) extracting a directed acyclic graph (DAG) based on dependencies between SQL and data, (2) validating the schema included in the whole DAG, and (3) parallelizing the queries based on the DAG. We applied AlphaSQL to a real-world data analysis and machine learning project where we analyzed 1445 logs obtained from static validation for git commits and 3243 execution logs. Our analysis showed that AlphaSQL detected various errors with high precision and recall, part of which existing tools could not catch (e.g., missing resources and schema mismatches). AlphaSQL would enable more maintainable data management using SQL.