Barzan Mozafari, Radu Alexandru Burcuta, Alan Cabrera, A. Constantin, Derek Francis, David Grömling, Alekh Jindal, Maciej Konkolowicz, Valentin Marian Spac, Yongjoo Park, Russell Razo Carranzo, Nicholas M. Richardson, Abhishek Roy, Aayushi Srivastava, Isha Tarte, B. Westphal, Chi Zhang
{"title":"Making Data Clouds Smarter at Keebo: Automated Warehouse Optimization using Data Learning","authors":"Barzan Mozafari, Radu Alexandru Burcuta, Alan Cabrera, A. Constantin, Derek Francis, David Grömling, Alekh Jindal, Maciej Konkolowicz, Valentin Marian Spac, Yongjoo Park, Russell Razo Carranzo, Nicholas M. Richardson, Abhishek Roy, Aayushi Srivastava, Isha Tarte, B. Westphal, Chi Zhang","doi":"10.1145/3555041.3589681","DOIUrl":"https://doi.org/10.1145/3555041.3589681","url":null,"abstract":"Data clouds in general, and cloud data warehouses (CDWs) in particular, have lowered the upfront expertise and infrastructure barriers, making it easy for a wider range of users to query large and diverse sources of data. This has made modern data pipelines more complex, harder to optimize, and therefore less resource efficient. As a result, the ongoing cost of data clouds can easily become prohibitively expensive. Further, since CDWs are general-purpose solutions that must serve a wide range of workloads, their out-of-box performance is sub-optimal for any single workload. Data teams therefore spend significant effort manually optimizing their queries and cloud infrastructure to curb costs while achieving reasonable performance. Aside from the opportunity cost of diverting data teams from business goals, manual optimization of millions of constantly changing queries is simply daunting. To the best of our knowledge, Keebo's Warehouse Optimization is the first fully-automated solution capable of making real-time optimization decisions that minimize the CDWs' overall cost while meeting the users' performance goals. Keebo learns from how users and applications interact with their CDW and uses its trained models to automatically optimize the warehouse settings, adjusts its resources (e.g., compute, memory), scale it up or down, suspend or resume it, and also self-correct in real-time based on the impact of its own actions.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125469631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Ilyas, JP Lacerda, Yunyao Li, U. F. Minhas, A. Mousavi, Jeffrey Pound, Theodoros Rekatsinas, C. Sumanth
{"title":"Growing and Serving Large Open-domain Knowledge Graphs","authors":"I. Ilyas, JP Lacerda, Yunyao Li, U. F. Minhas, A. Mousavi, Jeffrey Pound, Theodoros Rekatsinas, C. Sumanth","doi":"10.1145/3555041.3589672","DOIUrl":"https://doi.org/10.1145/3555041.3589672","url":null,"abstract":"Applications of large open-domain knowledge graphs (KGs) to real-world problems pose many unique challenges. In this paper, we present extensions to Saga our platform for continuous construction and serving of knowledge at scale. In particular, we describe a pipeline for training knowledge graph embeddings that powers key capabilities such as fact ranking, fact verification, a related entities service, and support for entity linking. We then describe how our platform, including graph embeddings, can be leveraged to create a Semantic Annotation service that links unstructured Web documents to entities in our KG. Semantic annotation of the Web effectively expands our knowledge graph with edges to open-domain Web content which can be used in various search and ranking problems. Finally, we leverage annotated Web documents to drive Open-domain Knowledge Extraction. This targeted extraction framework identifies important coverage issues in the KG, then finds relevant data sources for target entities on the Web and extracts missing information to enrich the KG. Finally, we describe adaptations to our knowledge platform needed to construct and serve private personal knowledge on-device. This includes private incremental KG construction, cross- device knowledge sync, and global knowledge enrichment.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127675953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DIALITE: Discover, Align and Integrate Open Data Tables","authors":"Aamod Khatiwada, Roee Shraga, Renée J. Miller","doi":"10.1145/3555041.3589732","DOIUrl":"https://doi.org/10.1145/3555041.3589732","url":null,"abstract":"We demonstrate a novel table discovery pipeline called DIALITE that allows users to discover, integrate and analyze open data tables. DIALITE has three main stages. First, it allows users to discover tables from open data platforms using state-of-the-art table discovery techniques. Second, DIALITE integrates the discovered tables to produce an integrated table. Finally, it allows users to analyze the integration result by applying different downstreaming tasks over it. Our pipeline is flexible such that the user can easily add and compare additional discovery and integration algorithms.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114578113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dangoron: Network Construction on Large-scale Time Series Data across Sliding Windows","authors":"Yunlong Xu, Peizhen Yang, Zhengbin Tao","doi":"10.1145/3555041.3589399","DOIUrl":"https://doi.org/10.1145/3555041.3589399","url":null,"abstract":"In complex networks, the dynamics of systems are represented through the interactions of a set of anomalous time series. A crucial problem to consider is the computation of correlations between highly correlated pairs of time series across sliding windows. The efficient calculation and updating of the correlation matrix, considering user-defined sliding periods and thresholds, are vital for facilitating large-scale time series network dynamics analysis. We present Dangoron, a framework meticulously designed for the efficient identification of highly correlated pairs of time series over sliding windows and the precise computation of their respective correlations. Dangoron predicts dynamic correlations across sliding windows and prunes unrelated time series, thereby yielding a performance at least an order of magnitude faster than a baseline approach. Additionally, we introduce Tomborg, the first benchmark specifically developed to address the challenge of correlation matrix computation in the context of time series analysis. This benchmark serves as a robust foundation for future research in this domain.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115853679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Processing with FPGAs on Modern Architectures","authors":"Wen Jiang, Dario Korolija, G. Alonso","doi":"10.1145/3555041.3589410","DOIUrl":"https://doi.org/10.1145/3555041.3589410","url":null,"abstract":"Trends in hardware, the prevalence of the cloud, and the rise of highly demanding applications have ushered an era of specialization that is quickly changing the way data is processed at scale. These changes are likely to continue and accelerate in the next years as new technologies are adopted and deployed: smart NICs, smart storage, smart memory, disaggregated storage, disaggregated memory, specialized accelerators (GPUS, TPUs, FPGAs), as well as a wealth of ASICS specifically created to deal with computationally expensive tasks (e.g., cryptography or compression). In this tutorial we focus on data processing on FPGAs, a technology that has received less attention than, e.g., TPUs or GPUs but that is, however, increasingly being deployed in the cloud for data processing tasks due to the architectural flexibility of FPGAs and their ability to process data at line rate, something not possible with other type of processors or accelerators. In the tutorial we will cover what are FPGAs, their characteristics, their advantages and disadvantages over other design options, as well as examples from deployments in industry and how they are used in a variety of data processing tasks. Then we will provide a brief introduction to FPGA programming with High Level Synthesis (HLS) tools as well as briefly describe resources available to researchers in the form of academic clusters and open source systems that simplify the first steps. The tutorial will also include several case studies borrowed from research done in collaboration with companies that illustrate both the potential of FPGAs in data processing but also how software and hardware architectures are evolving to take advantage of the possibilities offered by FPGAs.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115461664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sampling over Union of Joins","authors":"Yurong Liu, Yunlong Xu, F. Nargesian","doi":"10.1145/3555041.3589400","DOIUrl":"https://doi.org/10.1145/3555041.3589400","url":null,"abstract":"Data scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join and union, given a set of joins, we study the problem of obtaining a random sample from the union of joins without performing the full join and union. We present a general framework for random sampling over the set union of chain, acyclic, and cyclic joins, with sample uniformity and independence guarantees. We study the novel problem of union of joins size evaluation and propose two approximation methods based on histograms of columns and random walks on data. We propose an online union sampling framework that initializes with cheap-to-calculate parameter approximations and refines them on the fly during sampling. We evaluate our framework on workloads from the TPC-H benchmark and explore the trade-off of the accuracy of union approximation and sampling efficiency.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115932887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arlino Magalhães, Angelo Brayner, José Maria S. Monteiro
{"title":"Main Memory Database Recovery Strategies","authors":"Arlino Magalhães, Angelo Brayner, José Maria S. Monteiro","doi":"10.1145/3555041.3589402","DOIUrl":"https://doi.org/10.1145/3555041.3589402","url":null,"abstract":"Most of the current application scenarios, such as trading, real-time bidding, advertising, weather forecasting, social gaming, etc., require massive real-time data processing. Main memory database systems have proved to be an efficient alternative to such applications. These systems maintain the primary copy of the database in the main memory to achieve high throughput rates and low latency. However, a database in RAM is more vulnerable to failures than in traditional disk-oriented databases because of the memory volatility. DBMSs implement recovery activities (logging, checkpoint, and restart) for recovery proposes. Although the recovery component looks similar in disk- and memory-oriented systems, these systems differ dramatically in the way they implement their architectural components, such as data storage, indexing, concurrency control, query processing, durability, and recovery. This tutorial aims to provide a thorough review of in-memory database recovery techniques. To achieve this goal, we intend to review the main concepts of database recovery and architectural choices to implement an in-memory database system. Only then, we present the techniques to recover in-memory databases and discuss the recovery strategies of a representative sample of modern in-memory databases. Besides, the tutorial presents some aspects related to challenges and future directions of research in MMDBs in order to provide guidance for other researchers.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128503394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Companion of the 2023 International Conference on Management of Data","authors":"","doi":"10.1145/3555041","DOIUrl":"https://doi.org/10.1145/3555041","url":null,"abstract":"","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123630218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}