2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)最新文献_第2页

AlphaSQL: Open Source Software Tool for Automatic Dependency Resolution, Parallelization and Validation for SQL and Data AlphaSQL:用于SQL和数据的自动依赖解析、并行化和验证的开源软件工具

2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW) Pub Date : 2022-05-01 DOI: 10.1109/icdew55742.2022.00010

Masahiro Matsui, Takuto Sugisaki, Kensaku Okada, N. Koshizuka

{"title":"AlphaSQL: Open Source Software Tool for Automatic Dependency Resolution, Parallelization and Validation for SQL and Data","authors":"Masahiro Matsui, Takuto Sugisaki, Kensaku Okada, N. Koshizuka","doi":"10.1109/icdew55742.2022.00010","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00010","url":null,"abstract":"Improved performance of database systems has enabled faster SQL querying and more complex data processing. However, as the data becomes more complex and larger, SQL data processing becomes more difficult and costly. Typical problems include changing SQL queries and data schema resolution in complex dependencies by hand. In addition, human errors can lead to complex cyclic dependency problems. To mitigate these problems, we developed AlphaSQL: an open-source software tool for SQL data processing. AlphaSQL mainly supports three techniques to automate data preparation by SQL: (1) extracting a directed acyclic graph (DAG) based on dependencies between SQL and data, (2) validating the schema included in the whole DAG, and (3) parallelizing the queries based on the DAG. We applied AlphaSQL to a real-world data analysis and machine learning project where we analyzed 1445 logs obtained from static validation for git commits and 3243 execution logs. Our analysis showed that AlphaSQL detected various errors with high precision and recall, part of which existing tools could not catch (e.g., missing resources and schema mismatches). AlphaSQL would enable more maintainable data management using SQL.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134644591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Learned Index on GPU GPU学习索引

2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW) Pub Date : 2022-05-01 DOI: 10.1109/icdew55742.2022.00024

Xun Zhong, Yong Zhang, Yu Chen, Chao Li, Chunxiao Xing

{"title":"Learned Index on GPU","authors":"Xun Zhong, Yong Zhang, Yu Chen, Chao Li, Chunxiao Xing","doi":"10.1109/icdew55742.2022.00024","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00024","url":null,"abstract":"Index is a key structure created to quickly access specific information in database. Recent research on “learned indexes” has received extensive attention. The key idea is that index can be regarded as a model that maps keys to specific locations in data sets, so the traditional index structure can be replaced by machine learning models. Current learned indexes universally gain higher time efficiency and occupy smaller space than traditional indexes, but their query efficiency and concurrency are limited by CPU. GPU is widely used in computing intensive tasks because of its unique architecture and powerful computing ability. According to the research on learned index in recent years, we propose a new trait of thought to combine the advantages of GPU and learned index, which puts learned index in GPU memory and makes full use of the high concurrency and computing power of GPU. We implement the PGM-index on GPU and conduct an extensive set of experiments on several real-life and synthetic datasets. The results demonstrate that our method beats the original learned index on CPU by up to 20× for static workloads when query scale is large.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122715081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Datastack: Unification of Heterogeneous Machine Learning Dataset Interfaces 数据栈:异构机器学习数据集接口的统一

2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW) Pub Date : 2022-05-01 DOI: 10.1109/icdew55742.2022.00014

Max Lübbering, Maren Pielka, Ilhamcengiz Henk, R. Sifa

引用次数: 1

Sample-based Kernel Structure Learning with Deep Neural Networks for Automated Structure Discovery 基于样本的核结构学习与深度神经网络的自动结构发现

2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW) Pub Date : 2022-05-01 DOI: 10.1109/icdew55742.2022.00017

Alexander Grass, Till Döhmen, C. Beecks

引用次数: 0

Evaluating the Lottery Ticket Hypothesis to Sparsify Neural Networks for Time Series Classification 评估彩票假设对稀疏神经网络时间序列分类的影响

2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW) Pub Date : 2022-05-01 DOI: 10.1109/icdew55742.2022.00015

Georg Stefan Schlake, J. D. Hüwel, Fabian Berns, C. Beecks

引用次数: 1

Data placement in dynamic fog ecosystems 动态雾生态系统中的数据放置

2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW) Pub Date : 2022-05-01 DOI: 10.1109/icdew55742.2022.00009

Theodoros Toliopoulos, A. Michailidou, A. Gounaris

引用次数: 2

Join Path-Based Data Augmentation for Decision Trees 基于联接路径的决策树数据增强

2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW) Pub Date : 2022-05-01 DOI: 10.1109/icdew55742.2022.00018

Andra Ionescu, Rihan Hai, Marios Fragkoulis, Asterios Katsifodimos

引用次数: 2

Anatomy of Learned Database Tuning with Bayesian Optimization 学习数据库调优与贝叶斯优化的剖析

2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW) Pub Date : 2022-05-01 DOI: 10.1109/icdew55742.2022.00006

George-Octavian Barbulescu, P. Triantafillou

{"title":"Anatomy of Learned Database Tuning with Bayesian Optimization","authors":"George-Octavian Barbulescu, P. Triantafillou","doi":"10.1109/icdew55742.2022.00006","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00006","url":null,"abstract":"Database Management System (DBMS) tuning is central to the performance of the end-to-end database system. DBMSs are typically characterised by hundreds of configuration knobs that impact various facets of their behavior and planning abilities. Tuning such a system is a prohibitively-challenging task due to the obfuscated knob inter-dependencies and the intimidating size of the design space. The general vendor recommendation is to sequentially tune each knob, which further exacerbates the time-consuming nature of the task. To overcome this, recent work in the realm of self-driving database systems proxy the design problem through Machine Learning. Among the most prominent proxies in self-managing databases literature is the Bayesian-inference proxy. The purpose of this proxy, or surrogate in Bayesian Optimisation parlance, is to learn the inter-knob relationships and how they relate to the overall performance, independent of any human guidance. To this end, one of the goals of this work is to shed light on the common design patterns we identify in Bayesian-driven DBMS tuning agents. Second of all, we aim to provide a handbook for implementing such agents through the lens of a new tuning framework that leverages a multi-regression proxy.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"1225 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132028855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring System and Machine Learning Performance Interactions when Tuning Distributed Data Stream Applications 在调优分布式数据流应用程序时探索系统和机器学习性能交互

2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW) Pub Date : 2022-05-01 DOI: 10.1109/icdew55742.2022.00008

Lambros Odysseos, H. Herodotou

{"title":"Exploring System and Machine Learning Performance Interactions when Tuning Distributed Data Stream Applications","authors":"Lambros Odysseos, H. Herodotou","doi":"10.1109/icdew55742.2022.00008","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00008","url":null,"abstract":"Deploying machine learning (ML) applications over distributed stream processing engines (DSPEs) such as Apache Spark Streaming is a complex procedure that requires extensive tuning along two dimensions. First, DSPEs have a vast array of system configuration parameters (such as degree of parallelism, memory buffer sizes, etc.) that need to be optimized to achieve the desired levels of latency and/or throughput. Second, each ML model has its own set of hyper-parameters that need to be tuned as they significantly impact the overall prediction accuracy of the trained model. These two forms of tuning have been studied extensively in the literature but only in isolation from each other. This position paper identifies the necessity for a combined system and ML model tuning approach based on a thorough experimental study. In particular, experimental results have revealed unexpected and complex interactions between the choices of system configuration and hyper-parameters, and their impact on both application and model performance. These findings open up new research directions in the field of self-managing stream processing systems.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125031144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2