{"title":"Efficient Approximate Algorithms for Empirical Entropy and Mutual Information","authors":"Xingguang Chen, Sibo Wang","doi":"10.1145/3448016.3457255","DOIUrl":"https://doi.org/10.1145/3448016.3457255","url":null,"abstract":"Empirical entropy is a classic concept in data mining and the foundation of many other important concepts like mutual information. However, computing the exact empirical entropy/mutual information on large datasets can be expensive. Some recent research work explores sampling techniques on the empirical entropy/mutual information to speed up the top-k and filtering queries. However, their solution still aims to return the exact answers to the queries, resulting in high computational costs. Motivated by this, in this work, we present approximate algorithms for the top-k queries and filtering queries on empirical entropy and empirical mutual information. The approximate algorithm allows user-specified tunable parameters to control the trade-off between the query efficiency and accuracy. We design effective stopping rules to return the approximate answers with improved query time. We further present theoretical analysis and show that our proposed solutions achieve improved time complexity over previous solutions. We experimentally evaluate our proposed algorithms on real datasets with up to 31M records and 179 attributes. Our experimental results show that the proposed algorithm consistently outperforms the state of the art in terms of computational efficiency, by an order of magnitude in most cases, while providing the same accurate result.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"13 4-5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116859519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kfir Lev-Ari, Yizuo Tian, A. Shraer, C. Douglas, Hao Fu, Andrey Andreev, Kevin Beranek, Scott Dugas, Alec Grieser, Jeremy Hemmo
{"title":"QuiCK: A Queuing System in CloudKit","authors":"Kfir Lev-Ari, Yizuo Tian, A. Shraer, C. Douglas, Hao Fu, Andrey Andreev, Kevin Beranek, Scott Dugas, Alec Grieser, Jeremy Hemmo","doi":"10.1145/3448016.3457567","DOIUrl":"https://doi.org/10.1145/3448016.3457567","url":null,"abstract":"We present QuiCK, a queuing system built for managing asynchronous tasks in CloudKit, Apple's storage backend service. QuiCK stores queued messages along with user data in CloudKit, and supports CloudKit's tenancy model including isolation, fair resource allocation, observability, and tenant migration. QuiCK is built on the FoundationDB Record Layer, an open source transactional DBMS. It employs massive two-level sharding, with tens of billions of queues on the first level (separately storing the queued items for each user of every CloudKit app), and hundreds of queues on a second level (one per FoundationDB cluster used by CloudKit). Our evaluation demonstrates that QuiCK scales linearly with additional consumer resources, effectively avoids contention, provides fairness across CloudKit tenants, and executes deferred tasks with low latency.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128482081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Model-Parallel Model Selection for Deep Learning Systems","authors":"Kabir Nagrecha","doi":"10.1145/3448016.3450571","DOIUrl":"https://doi.org/10.1145/3448016.3450571","url":null,"abstract":"As deep learning becomes more expensive, both in terms of time and compute, inefficiencies in machine learning training prevent practical usage of state-of-the-art models for most users. The newest model architectures are simply too large to be fit onto a single processor. To address the issue, many ML practitioners have turned to model parallelism as a method of distributing the computational requirements across several devices. Unfortunately, the sequential nature of neural networks causes very low efficiency and device utilization in model parallel training jobs. We propose a new form of \"shard parallelism\" combining task parallelism and model parallelism, and package it into a framework we name Hydra. Hydra recasts the problem of model parallelism in the multi-model context to produce a fine-grained parallel workload of independent model shards, rather than independent models. This new parallel design promises dramatic speedups relative to the traditional model parallelism paradigm.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128383522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Not your Grandpa's SSD: The Era of Co-Designed Storage Devices","authors":"Alberto Lerner, Philippe Bonnet","doi":"10.1145/3448016.3457540","DOIUrl":"https://doi.org/10.1145/3448016.3457540","url":null,"abstract":"Gone is the time when a Solid-State Drive (SSD) was just a fast drop-in replacement for a Hard-Disk Drive (HDD). Thanks to the NVMe ecosystem, nowadays, SSDs are accessed through specific interfaces and modern I/O frameworks. SSDs have also grown versatile with time and can now support various use cases ranging from cold, high-density storage to hot, low-latency ones. The body of knowledge about building such different devices is mostly available, but it is less than accessible to non-experts. Finding which device variation can better support a given workload also requires deep domain knowledge. This tutorial's first goal is to make these tasks--understanding the design of SSDs and pairing them with the data-intensive workloads they support well--more inviting. The tutorial goes further, however, in that it suggests that a new kind of SSD plays an essential role in post-Moore computer systems. These devices can be co-designed to align their capabilities to an application's requirements. A salient feature of these devices is that they can run application logic besides just storing data. They can thus gracefully scale processing capabilities with the volume of data stored. The tutorial's second goal is thus to establish the design space for co-designed SSDs and show the tools available to hardware, systems, and databases researchers that wish to explore this space.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124563207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyunjoon Kim, Yunyoung Choi, Kunsoo Park, Xuemin Lin, Seok-Hee Hong, Wook-Shin Han
{"title":"Versatile Equivalences: Speeding up Subgraph Query Processing and Subgraph Matching","authors":"Hyunjoon Kim, Yunyoung Choi, Kunsoo Park, Xuemin Lin, Seok-Hee Hong, Wook-Shin Han","doi":"10.1145/3448016.3457265","DOIUrl":"https://doi.org/10.1145/3448016.3457265","url":null,"abstract":"Subgraph query processing (also known as subgraph search) and subgraph matching are fundamental graph problems in many application domains. A lot of efforts have been made to develop practical solutions for these problems. Despite the efforts, existing algorithms showed limited running time and scalability in dealing with large and/or many graphs. In this paper, we propose a new subgraph search algorithm using equivalences of vertices in order to reduce search space: (1) static equivalence of vertices in a query graph that leads to an efficient matching order of the vertices, and (2) dynamic equivalence of candidate vertices in a data graph, which enables us to capture and remove redundancies in search space. These techniques for subgraph search also lead to an improved algorithm for subgraph matching. Experiments show that our approach outperforms state-of-the-art subgraph search and subgraph matching algorithms by up to several orders of magnitude with respect to query processing time.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123309853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PyExplore: Query Recommendations for Data Exploration without Query Logs","authors":"Apostolos Glenis, G. Koutrika","doi":"10.1145/3448016.3452762","DOIUrl":"https://doi.org/10.1145/3448016.3452762","url":null,"abstract":"Helping users explore data becomes increasingly more important as databases get larger and more complex. In this demo, we present PyExplore, a data exploration tool aimed at helping end users formulate queries over new datasets. PyExplore takes as input an initial query from the user along with some parameters and provides interesting queries by leveraging data correlations and diversity.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121390902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Grouped Learning: Group-By Model Selection Workloads","authors":"Side Li","doi":"10.1145/3448016.3450576","DOIUrl":"https://doi.org/10.1145/3448016.3450576","url":null,"abstract":"Machine Learning (ML) is gaining popularity in many applications. Increasingly, companies prefer more targeted models for different subgroups of the population like locations, which helps improve accuracy. This practice is comparable to Group-By aggregation in SQL; we call it learning over groups. A smaller group means the data distribution is more straightforward than the whole population. So, a group-level model may offer more accuracy in many cases. Non-technical business needs, such as privacy and regulatory compliance, may also necessitate group-level models. For instance, online advertising platforms would need to build disaggregated partner-specific ML models, where all partner groups' training data are aggregated together in one data pipeline.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124093627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximilian E. Schüle, Josef Schmeißer, T. Blum, Alfons Kemper, Thomas Neumann
{"title":"TardisDB","authors":"Maximilian E. Schüle, Josef Schmeißer, T. Blum, Alfons Kemper, Thomas Neumann","doi":"10.1145/3448016.3452767","DOIUrl":"https://doi.org/10.1145/3448016.3452767","url":null,"abstract":"Online encyclopaedias such as Wikipedia implement their own version control above database systems to manage multiple revisions of the same page. In contrast to temporal databases that restrict each tuple's validity to a time range, a version affects multiple tuples. To overcome the need for a separate version layer, we have created TardisDB, the first database system with incorporated data versioning across multiple relations. This paper presents the interface for TardisDB with an extended SQL to manage and query data from different branches. We first give an overview of TardisDB's architecture that includes an extended table scan operator: a branch bitmap indicates a tuple's affiliation to a branch and a chain of tuples tracks the different versions. This is the first database system that combines chains for multiversion concurrency control with a bitmap for each branch to enable versioning. Afterwards, we describe our proposed SQL extension to create, query and modify tables across different, named branches. In our demonstration setup, we allow users to interactively create and edit branches and display the lineage of each branch.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115740928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyi Zhang, Hong Wu, Zhuonan Chang, Shuowei Jin, Jian Tan, Feifei Li, Tieying Zhang, Bin Cui
{"title":"ResTune: Resource Oriented Tuning Boosted by Meta-Learning for Cloud Databases","authors":"Xinyi Zhang, Hong Wu, Zhuonan Chang, Shuowei Jin, Jian Tan, Feifei Li, Tieying Zhang, Bin Cui","doi":"10.1145/3448016.3457291","DOIUrl":"https://doi.org/10.1145/3448016.3457291","url":null,"abstract":"Modern database management systems (DBMS) contain tens to hundreds of critical performance tuning knobs that determine the system runtime behaviors. To reduce the total cost of ownership, cloud database providers put in drastic effort to automatically optimize the resource utilization by tuning these knobs. There are two challenges. First, the tuning system should always abide by the service level agreement (SLA) while optimizing the resource utilization, which imposes strict constrains on the tuning process. Second, the tuning time should be reasonably acceptable since time-consuming tuning is not practical for production and online troubleshooting. In this paper, we design ResTune to automatically optimize the resource utilization without violating SLA constraints on the throughput and latency requirements. ResTune leverages the tuning experience from the history tasks and transfers the accumulated knowledge to accelerate the tuning process of the new tasks. The prior knowledge is represented from historical tuning tasks through an ensemble model. The model learns the similarity between the historical workloads and the target, which significantly reduces the tuning time by a meta-learning based approach. ResTune can efficiently handle different workloads and various hardware environments. We perform evaluations using benchmarks and real world workloads on different types of resources. The results show that, compared with the manually tuned configurations, ResTune reduces 65%, 87%, 39% of CPU utilization, I/O and memory on average, respectively. Compared with the state-of-the-art methods, ResTune finds better configurations with up to ~18x speedups.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130029940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shangyu Luo, Dimitrije Jankov, Binhang Yuan, C. Jermaine
{"title":"Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear Algebra","authors":"Shangyu Luo, Dimitrije Jankov, Binhang Yuan, C. Jermaine","doi":"10.1145/3448016.3457317","DOIUrl":"https://doi.org/10.1145/3448016.3457317","url":null,"abstract":"Machine learning (ML) computations are often expressed using vectors, matrices, or higher-dimensional tensors. Such data structures can have many different implementations, especially in a distributed environment: a matrix could be stored as row or column vectors, tiles of different sizes, or relationally, as a set of (rowIndex, colIndex, value) triples. Many other storage formats are possible. The choice of format can have a profound impact on the performance of a ML computation. In this paper, we propose a framework for automatic optimization of the physical implementation of a complex ML or linear algebra (LA) computation in a distributed environment, develop algorithms for solving this problem, and show, through a prototype on top of a distributed relational database system, that our ideas can radically speed up common ML and LA computations.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134449711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}