Masahiro Matsui, Takuto Sugisaki, Kensaku Okada, N. Koshizuka
{"title":"AlphaSQL: Open Source Software Tool for Automatic Dependency Resolution, Parallelization and Validation for SQL and Data","authors":"Masahiro Matsui, Takuto Sugisaki, Kensaku Okada, N. Koshizuka","doi":"10.1109/icdew55742.2022.00010","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00010","url":null,"abstract":"Improved performance of database systems has enabled faster SQL querying and more complex data processing. However, as the data becomes more complex and larger, SQL data processing becomes more difficult and costly. Typical problems include changing SQL queries and data schema resolution in complex dependencies by hand. In addition, human errors can lead to complex cyclic dependency problems. To mitigate these problems, we developed AlphaSQL: an open-source software tool for SQL data processing. AlphaSQL mainly supports three techniques to automate data preparation by SQL: (1) extracting a directed acyclic graph (DAG) based on dependencies between SQL and data, (2) validating the schema included in the whole DAG, and (3) parallelizing the queries based on the DAG. We applied AlphaSQL to a real-world data analysis and machine learning project where we analyzed 1445 logs obtained from static validation for git commits and 3243 execution logs. Our analysis showed that AlphaSQL detected various errors with high precision and recall, part of which existing tools could not catch (e.g., missing resources and schema mismatches). AlphaSQL would enable more maintainable data management using SQL.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134644591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learned Index on GPU","authors":"Xun Zhong, Yong Zhang, Yu Chen, Chao Li, Chunxiao Xing","doi":"10.1109/icdew55742.2022.00024","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00024","url":null,"abstract":"Index is a key structure created to quickly access specific information in database. Recent research on “learned indexes” has received extensive attention. The key idea is that index can be regarded as a model that maps keys to specific locations in data sets, so the traditional index structure can be replaced by machine learning models. Current learned indexes universally gain higher time efficiency and occupy smaller space than traditional indexes, but their query efficiency and concurrency are limited by CPU. GPU is widely used in computing intensive tasks because of its unique architecture and powerful computing ability. According to the research on learned index in recent years, we propose a new trait of thought to combine the advantages of GPU and learned index, which puts learned index in GPU memory and makes full use of the high concurrency and computing power of GPU. We implement the PGM-index on GPU and conduct an extensive set of experiments on several real-life and synthetic datasets. The results demonstrate that our method beats the original learned index on CPU by up to 20× for static workloads when query scale is large.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122715081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Max Lübbering, Maren Pielka, Ilhamcengiz Henk, R. Sifa
{"title":"Datastack: Unification of Heterogeneous Machine Learning Dataset Interfaces","authors":"Max Lübbering, Maren Pielka, Ilhamcengiz Henk, R. Sifa","doi":"10.1109/icdew55742.2022.00014","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00014","url":null,"abstract":"Machine learning (ML) dataset preprocessing, cleaning, and integration into ML pipelines is often a cum-bersome endeavor that is susceptible to bugs and leads to unstructured code from the start. While existing frameworks for dataset integration often come with an extensive dataset repository, extending these repositories to new datasets is nontrivial due to lack of dataset retrieval, processing and iterator separation. To simplify the process of dataset integration, we present Datastack, an open-source framework that minimizes these efforts by providing well-defined interfaces that seamlessly integrate into existing machine learning frameworks. Inspired by stream processing frameworks such as Flink or Storm, Datastack decouples dataset-specific peculiarities such as custom data formats from the framework by introducing byte streams on an interface level. Furthermore, Datastack delivers dataset preprocessing functionalities such as stacking, splitting, and merging to alleviate error-prone data processing pipelines.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132927320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sample-based Kernel Structure Learning with Deep Neural Networks for Automated Structure Discovery","authors":"Alexander Grass, Till Döhmen, C. Beecks","doi":"10.1109/icdew55742.2022.00017","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00017","url":null,"abstract":"Time series are prominent in a broad variety of application domains. Given a time series, how to automatically derive its inherent structure? While Gaussian process models can describe structure characteristics by their individual exploitation of covariance functions, their inference is still a computationally complex task. State-of-the-art methods therefore aim to efficiently infer an interpretable model by searching appropriate kernel compositions associated with a high-dimensional hyperparameter space. In this work, we propose a new alternative approach to learn structural components of a time series directly without inference. To this end we train a deep neural network based on kernel-induced samples, in order to obtain a generalized model for the estimation of kernel compositions. Our investigations show that our proposed approach is able to effectively classify kernel compositions of random time series data as well as estimate their hyperparameters efficiently and with high accuracy.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115321852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georg Stefan Schlake, J. D. Hüwel, Fabian Berns, C. Beecks
{"title":"Evaluating the Lottery Ticket Hypothesis to Sparsify Neural Networks for Time Series Classification","authors":"Georg Stefan Schlake, J. D. Hüwel, Fabian Berns, C. Beecks","doi":"10.1109/icdew55742.2022.00015","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00015","url":null,"abstract":"Reducing the complexity of deep learning models is a challenging task in many machine learning pipelines. In particular for increasingly complex data spaces, the question of how to mitigate storage efforts for large machine learning models becomes of crucial importance. The recently proposed Lottery Ticket Hypothesis is one promising approach in order to decrease the size of a neural network without losing its expressiveness. While the Lottery Ticket Hypothesis has been shown to outperform other pruning methods in the field of image classification, it has not yet been extensively investigated in the domain of time series. In this paper, we thus investigate this hypothesis for the task of time series classification and empirically show that different deep learning architectures can be compressed by large factors without sacrificing expressiveness.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134093257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theodoros Toliopoulos, A. Michailidou, A. Gounaris
{"title":"Data placement in dynamic fog ecosystems","authors":"Theodoros Toliopoulos, A. Michailidou, A. Gounaris","doi":"10.1109/icdew55742.2022.00009","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00009","url":null,"abstract":"Dynamic data placement in distributed fog databases needs to employ different mechanisms than those typically encountered in modern cloud-hosted storage solutions to account for node instability and latency in receiving the results from queries running across multiple sites. In this work, we examine two dynamic data placement policies. The first is based on the fog node stability metadata, while the second one is driven by analytic applications running on top of distributed storages taking into account the latency, data freshness and quality objectives. Both our policies are enabled through extending Apache Ignite.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125162064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andra Ionescu, Rihan Hai, Marios Fragkoulis, Asterios Katsifodimos
{"title":"Join Path-Based Data Augmentation for Decision Trees","authors":"Andra Ionescu, Rihan Hai, Marios Fragkoulis, Asterios Katsifodimos","doi":"10.1109/icdew55742.2022.00018","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00018","url":null,"abstract":"Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techniques can help increase the richness of training data, thus increasing the ML model accuracy. Existing solutions focus on efficiency and ML model accuracy but do not exploit the richness of dataset relationships. With relational data, the challenge lies in identifying join paths that best augment a feature table to increase the performance of a model. In this paper we propose a two-step, automated data augmentation approach for relational data that involves: (i) enumerating join paths of various lengths given a base table and (ii) ranking the join paths using filter methods for feature selection. We show that our approach can improve prediction accuracy and reduce runtime compared to the baseline approach.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123785440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Anatomy of Learned Database Tuning with Bayesian Optimization","authors":"George-Octavian Barbulescu, P. Triantafillou","doi":"10.1109/icdew55742.2022.00006","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00006","url":null,"abstract":"Database Management System (DBMS) tuning is central to the performance of the end-to-end database system. DBMSs are typically characterised by hundreds of configuration knobs that impact various facets of their behavior and planning abilities. Tuning such a system is a prohibitively-challenging task due to the obfuscated knob inter-dependencies and the intimidating size of the design space. The general vendor recommendation is to sequentially tune each knob, which further exacerbates the time-consuming nature of the task. To overcome this, recent work in the realm of self-driving database systems proxy the design problem through Machine Learning. Among the most prominent proxies in self-managing databases literature is the Bayesian-inference proxy. The purpose of this proxy, or surrogate in Bayesian Optimisation parlance, is to learn the inter-knob relationships and how they relate to the overall performance, independent of any human guidance. To this end, one of the goals of this work is to shed light on the common design patterns we identify in Bayesian-driven DBMS tuning agents. Second of all, we aim to provide a handbook for implementing such agents through the lens of a new tuning framework that leverages a multi-regression proxy.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"1225 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132028855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring System and Machine Learning Performance Interactions when Tuning Distributed Data Stream Applications","authors":"Lambros Odysseos, H. Herodotou","doi":"10.1109/icdew55742.2022.00008","DOIUrl":"https://doi.org/10.1109/icdew55742.2022.00008","url":null,"abstract":"Deploying machine learning (ML) applications over distributed stream processing engines (DSPEs) such as Apache Spark Streaming is a complex procedure that requires extensive tuning along two dimensions. First, DSPEs have a vast array of system configuration parameters (such as degree of parallelism, memory buffer sizes, etc.) that need to be optimized to achieve the desired levels of latency and/or throughput. Second, each ML model has its own set of hyper-parameters that need to be tuned as they significantly impact the overall prediction accuracy of the trained model. These two forms of tuning have been studied extensively in the literature but only in isolation from each other. This position paper identifies the necessity for a combined system and ML model tuning approach based on a thorough experimental study. In particular, experimental results have revealed unexpected and complex interactions between the choices of system configuration and hyper-parameters, and their impact on both application and model performance. These findings open up new research directions in the field of self-managing stream processing systems.","PeriodicalId":429378,"journal":{"name":"2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125031144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}