Matthias Boehm, Madelon Hulsebos, Shreya Shankar, P. Varma
{"title":"Seventh Workshop on Data Management for End-to-End Machine Learning (DEEM)","authors":"Matthias Boehm, Madelon Hulsebos, Shreya Shankar, P. Varma","doi":"10.1145/3555041.3590819","DOIUrl":"https://doi.org/10.1145/3555041.3590819","url":null,"abstract":"The DEEM'23 workshop (Data Management for End-to-End Machine Learning) is held on Sunday June 18th, in conjunction with SIGMOD/PODS 2023. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers (10 pages) describing preliminary and ongoing research results, including industrial experience reports of end-to-end ML deployments, related to DEEM topics. In addition, DEEM 2023 has a category for short papers (4 pages) as a forum for sharing interesting use cases, problems, datasets, benchmarks, visionary ideas, system designs, preliminary results, and descriptions of system components and tools related to end-to-end ML pipelines. The workshop received 13 high-quality submissions on diverse topics relevant to DEEM, of which 6 regular papers and 7 short papers.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114982867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a Framework for Data Pipeline Discovery","authors":"D. Benvenuti","doi":"10.1145/3555041.3589395","DOIUrl":"https://doi.org/10.1145/3555041.3589395","url":null,"abstract":"With the recent developments of Internet of Things (IoT) and cloudbased technologies, massive amounts of data are generated by heterogeneous sources and stored through dedicated cloud solutions. Often organizations generate much more data than they are able to interpret, and current Cloud Computing technologies cannot fully meet the requirements of the Big Data processing applications and their data transfer overheads [3]. Many data are stored for compliance purposes only but not turned into value, thus becoming Dark Data, which are not only an unused value but also pose a risk for organizations [7, 18].","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115095007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jannis Becktepe, Mahdi Esmailoghli, Maximilian Koch, Ziawasch Abedjan
{"title":"Demonstrating MATE and COCOA for Data Discovery","authors":"Jannis Becktepe, Mahdi Esmailoghli, Maximilian Koch, Ziawasch Abedjan","doi":"10.1145/3555041.3589716","DOIUrl":"https://doi.org/10.1145/3555041.3589716","url":null,"abstract":"One of the common use cases for data discovery is to enrich a given table with additional columns from related tables inside a data lake. We have recently introduced MATE and COCOA, two systems for joinability discovery and correlation calculation, respectively. By leveraging two novel index structures, a hash-based Super Key Index, and an Order Index, our system is capable of efficiently identifying tables that join on multiple columns and contain relevant features. We show how the data exploration and enrichment process benefits from our index structures by demonstrating MaCo, a unified system on top of open web and large table corpora.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116346293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rogers Jeffrey Leo John, Dylan Bacon, Junda Chen, Ushmal Ramesh, Jiatong Li, Deepan Das, R. Claus, Amos Kendall, Jignesh M. Patel
{"title":"DataChat: An Intuitive and Collaborative Data Analytics Platform","authors":"Rogers Jeffrey Leo John, Dylan Bacon, Junda Chen, Ushmal Ramesh, Jiatong Li, Deepan Das, R. Claus, Amos Kendall, Jignesh M. Patel","doi":"10.1145/3555041.3589678","DOIUrl":"https://doi.org/10.1145/3555041.3589678","url":null,"abstract":"Enterprises invest in data platforms with the aim of extracting meaningful information through analytics. Typically, experts create analytics pipelines that feed into dashboards and provide answers to predetermined questions. This approach makes analytics a spectator sport for most people and introduces operational bottlenecks to leveraging those investments. To improve the value derived from data, many organizations are opting to open up their data assets and allow access to a wider range of users. However, using programming languages such as SQL and Python for analytics can be difficult for most enterprise users. DataChat provides a simplified data science approach that is intuitive, powerful, and accessible to all data users. The platform is built on a library of data functions that are cleanly abstracted to maximize efficiency and ease of use while maintaining a rich suite of tools necessary for data science. With these functions, users can create data analysis pipelines by using a simple point-and-click interface in a spreadsheet view or by using natural English interfaces. Modern sharing and collaboration features are central to all aspects of the platform, allowing teams to easily bridge expertise gaps. A deeper understanding of results is facilitated by providing automatically-generated English explanations of how they were derived. By enhancing these aspects of data science and human-to-human communication, the platform addresses the needs that many organizations are encountering as their analytics needs mature.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"1176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116481453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fotis Psallidas, Megan Leszczynski, M. Namaki, A. Floratou, Ashvin Agrawal, Konstantinos Karanasos, Subru Krishnan, Pavle Subotic, Markus Weimer, Yinghui Wu, Yiwen Zhu
{"title":"Demonstration of Geyser: Provenance Extraction and Applications over Data Science Scripts","authors":"Fotis Psallidas, Megan Leszczynski, M. Namaki, A. Floratou, Ashvin Agrawal, Konstantinos Karanasos, Subru Krishnan, Pavle Subotic, Markus Weimer, Yinghui Wu, Yiwen Zhu","doi":"10.1145/3555041.3589717","DOIUrl":"https://doi.org/10.1145/3555041.3589717","url":null,"abstract":"As enterprises have started developing and deploying complicated data science workloads at scale, the need for mechanisms that enable enterprise-grade data science (e.g., compliance or auditing) has become more pronounced. In this paper, we present Geyser, an extensible provenance system for data science workloads that can be used as a foundation for enterprise-grade data science. Our system supports both static and dynamic provenance, over a wide range of data science scripts, driven by a knowledge base of data science APIs. We demonstrate the wide applicability of the system using various industrial applications: provenance extraction, model compliance, model linting, model versioning, and poisoning detection. A video of the demonstration is available at https://aka.ms/geyserdemo.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125960802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ARENA: Alternative Relational Query Plan Exploration for Database Education","authors":"Hu Wang, Hui Li, S. Bhowmick, Baochao Xu","doi":"10.1145/3555041.3589713","DOIUrl":"https://doi.org/10.1145/3555041.3589713","url":null,"abstract":"A key learning goal of learners taking a database systems course is to understand how SQL queries are processed in an RDBMS in practice. To this end, comprehension of different alternative query plans (AQPs) that may be considered during the selection of the query execution plan (QEP) of a query is paramount. In this demonstration, we present a novel and generic system called ARENA that facilitates exploration of informative alternative query plans of a given SQL query to aid the comprehension of QEP selection. Under the hood, ARENA addresses a novel problem called informative plan selection problem (TIPS) which aims to discover alternative plans from the underlying plan space so that the plan informativeness is maximized. We demonstrate various innovative features of ARENA emphasizing the important role it can play in supplementing database education.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125905707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bhavya Chopra, Anna Fariha, Sumit Gulwani, Austin Z. Henley, Daniel Perelman, Mohammad Raza, Sherry Shi, D. Simmons, Ashish Tiwari
{"title":"CoWrangler: Recommender System for Data-Wrangling Scripts","authors":"Bhavya Chopra, Anna Fariha, Sumit Gulwani, Austin Z. Henley, Daniel Perelman, Mohammad Raza, Sherry Shi, D. Simmons, Ashish Tiwari","doi":"10.1145/3555041.3589722","DOIUrl":"https://doi.org/10.1145/3555041.3589722","url":null,"abstract":"We present CoWrangler, a real-time data wrangling recommender system, which can recommend the next-best data wrangling operations along with the corresponding human-readable and efficient code snippets to expedite data exploration and wrangling efforts. A key feature of CoWrangler is that it provides explanations for the generated suggestions in the form of data insights, allowing the user to place confidence in the system. Under the hood, CoWrangler relies on intelligent generation of candidate suggestions using program synthesis techniques and ranking of a set of suggestions based on the notion of data quality improvement. We demonstrate how CoWrangler provides a human-in-the-loop data wrangling experience, and helps users make informed data pre-processing decisions, while saving their time and effort.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122364988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Madelon Hulsebos, Xiang Deng, Huan Sun, Paolo Papotti
{"title":"Models and Practice of Neural Table Representations","authors":"Madelon Hulsebos, Xiang Deng, Huan Sun, Paolo Papotti","doi":"10.1145/3555041.3589411","DOIUrl":"https://doi.org/10.1145/3555041.3589411","url":null,"abstract":"In the last few years, the natural language processing community witnessed advances in neural representations of free-form text with transformer-based language models (LMs). Given the importance of knowledge available in relational tables, recent research efforts extend LMs by developing neural representations for tabular data. In this tutorial, we present these proposals with three main goals. First, we aim at introducing the potentials and limitations of current models to a database audience. Second, we want the attendees to see the benefit of such line of work in a large variety of data applications. Third, we would like to empower the audience with a new set of tools and to inspire them to tackle some of the important directions for neural table representations, including model and system design, evaluation, application and deployment. To achieve these goals, the tutorial is organized in two parts. The first part covers the background for neural table representations, including a survey of the most important systems. The second part is designed as a hands-on session, where attendees will use their laptop to explore this new framework and test neural models involving text and tabular data.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128030397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hesam Shahrokhi, Callum Groeger, Yizhuo Yang, A. Shaikhha
{"title":"Efficient Query Processing in Python Using Compilation","authors":"Hesam Shahrokhi, Callum Groeger, Yizhuo Yang, A. Shaikhha","doi":"10.1145/3555041.3589735","DOIUrl":"https://doi.org/10.1145/3555041.3589735","url":null,"abstract":"In this paper, we present a framework for efficient query processing in Python. Inspired by the increasing interest in Python-based frameworks such as TensorFlow and Pandas for data scientists, our framework consists of three different input languages. The first language is SQL; to better integrate the SQL queries with the rest of the data science pipeline, by relying on off-the-shelf query optimizers (e.g., PostgreSQL) the SQL code is translated to a physical query plan, which is in turn translated to Pandas code. The second input is Pandas code; it can be either run by Pandas itself or alternatively be translated into SDQL.py, the third input language that can be translated into efficient low-level code and can achieve an order-of-magnitude performance improvement over Pandas. Our framework exposes a Python-based API that allows data scientists to use SDQL.py as a pure Python library.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131936259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huidong Zhang, Luyi Qu, Qingshuai Wang, Rong Zhang, Peng Cai, Quanqing Xu, Zhifeng Yang, Chuanhui Yang
{"title":"Dike: A Benchmark Suite for Distributed Transactional Databases","authors":"Huidong Zhang, Luyi Qu, Qingshuai Wang, Rong Zhang, Peng Cai, Quanqing Xu, Zhifeng Yang, Chuanhui Yang","doi":"10.1145/3555041.3589710","DOIUrl":"https://doi.org/10.1145/3555041.3589710","url":null,"abstract":"Distributed relational database management systems (abbr. DDBMSs) for online transaction processing (abbr. OLTP) have been gradually adopted in production environments. With many relevant products vying for the markets, an unbiased benchmark is urgently needed to promote the development of transactional DDBMSs. Current benchmarks for OLTP applications have not taken the challenges encountered during the designs and implementations of a transactional DDBMS into consideration, which expects to provide high elasticity and availability as well as high throughputs. We propose a benchmark suite Dike to evaluate the efforts to tackle these challenges. Dike is designed mainly from three aspects: quantitative control to evaluate scalability, imbalanced distribution to evaluate schedulability, and comprehensive fault injections to evaluate availability. It also provides a dynamic load control to simulate real-world scenarios. In this demonstration, users can experience core features of Dike with user-friendly interfaces.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122604635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}