Masahiro Matsui, Takuto Sugisaki, Kensaku Okada, N. Koshizuka
{"title":"AlphaSQL:用于SQL和数据的自动依赖解析、并行化和验证的开源软件工具","authors":"Masahiro Matsui, Takuto Sugisaki, Kensaku Okada, N. Koshizuka","doi":"10.1109/icdew55742.2022.00010","DOIUrl":null,"url":null,"abstract":"Improved performance of database systems has enabled faster SQL querying and more complex data processing. However, as the data becomes more complex and larger, SQL data processing becomes more difficult and costly. Typical problems include changing SQL queries and data schema resolution in complex dependencies by hand. In addition, human errors can lead to complex cyclic dependency problems. To mitigate these problems, we developed AlphaSQL: an open-source software tool for SQL data processing. AlphaSQL mainly supports three techniques to automate data preparation by SQL: (1) extracting a directed acyclic graph (DAG) based on dependencies between SQL and data, (2) validating the schema included in the whole DAG, and (3) parallelizing the queries based on the DAG. We applied AlphaSQL to a real-world data analysis and machine learning project where we analyzed 1445 logs obtained from static validation for git commits and 3243 execution logs. Our analysis showed that AlphaSQL detected various errors with high precision and recall, part of which existing tools could not catch (e.g., missing resources and schema mismatches). AlphaSQL would enable more maintainable data management using SQL.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"AlphaSQL: Open Source Software Tool for Automatic Dependency Resolution, Parallelization and Validation for SQL and Data\",\"authors\":\"Masahiro Matsui, Takuto Sugisaki, Kensaku Okada, N. Koshizuka\",\"doi\":\"10.1109/icdew55742.2022.00010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Improved performance of database systems has enabled faster SQL querying and more complex data processing. However, as the data becomes more complex and larger, SQL data processing becomes more difficult and costly. Typical problems include changing SQL queries and data schema resolution in complex dependencies by hand. In addition, human errors can lead to complex cyclic dependency problems. To mitigate these problems, we developed AlphaSQL: an open-source software tool for SQL data processing. AlphaSQL mainly supports three techniques to automate data preparation by SQL: (1) extracting a directed acyclic graph (DAG) based on dependencies between SQL and data, (2) validating the schema included in the whole DAG, and (3) parallelizing the queries based on the DAG. We applied AlphaSQL to a real-world data analysis and machine learning project where we analyzed 1445 logs obtained from static validation for git commits and 3243 execution logs. Our analysis showed that AlphaSQL detected various errors with high precision and recall, part of which existing tools could not catch (e.g., missing resources and schema mismatches). AlphaSQL would enable more maintainable data management using SQL.\",\"PeriodicalId\":429378,\"journal\":{\"name\":\"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/icdew55742.2022.00010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icdew55742.2022.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
AlphaSQL: Open Source Software Tool for Automatic Dependency Resolution, Parallelization and Validation for SQL and Data
Improved performance of database systems has enabled faster SQL querying and more complex data processing. However, as the data becomes more complex and larger, SQL data processing becomes more difficult and costly. Typical problems include changing SQL queries and data schema resolution in complex dependencies by hand. In addition, human errors can lead to complex cyclic dependency problems. To mitigate these problems, we developed AlphaSQL: an open-source software tool for SQL data processing. AlphaSQL mainly supports three techniques to automate data preparation by SQL: (1) extracting a directed acyclic graph (DAG) based on dependencies between SQL and data, (2) validating the schema included in the whole DAG, and (3) parallelizing the queries based on the DAG. We applied AlphaSQL to a real-world data analysis and machine learning project where we analyzed 1445 logs obtained from static validation for git commits and 3243 execution logs. Our analysis showed that AlphaSQL detected various errors with high precision and recall, part of which existing tools could not catch (e.g., missing resources and schema mismatches). AlphaSQL would enable more maintainable data management using SQL.