Proceedings of the 2021 International Conference on Management of Data最新文献

筛选
英文 中文
Efficient Approximate Algorithms for Empirical Entropy and Mutual Information 经验熵和互信息的有效近似算法
Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3457255
Xingguang Chen, Sibo Wang
{"title":"Efficient Approximate Algorithms for Empirical Entropy and Mutual Information","authors":"Xingguang Chen, Sibo Wang","doi":"10.1145/3448016.3457255","DOIUrl":"https://doi.org/10.1145/3448016.3457255","url":null,"abstract":"Empirical entropy is a classic concept in data mining and the foundation of many other important concepts like mutual information. However, computing the exact empirical entropy/mutual information on large datasets can be expensive. Some recent research work explores sampling techniques on the empirical entropy/mutual information to speed up the top-k and filtering queries. However, their solution still aims to return the exact answers to the queries, resulting in high computational costs. Motivated by this, in this work, we present approximate algorithms for the top-k queries and filtering queries on empirical entropy and empirical mutual information. The approximate algorithm allows user-specified tunable parameters to control the trade-off between the query efficiency and accuracy. We design effective stopping rules to return the approximate answers with improved query time. We further present theoretical analysis and show that our proposed solutions achieve improved time complexity over previous solutions. We experimentally evaluate our proposed algorithms on real datasets with up to 31M records and 179 attributes. Our experimental results show that the proposed algorithm consistently outperforms the state of the art in terms of computational efficiency, by an order of magnitude in most cases, while providing the same accurate result.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"13 4-5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116859519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
QuiCK: A Queuing System in CloudKit QuiCK: CloudKit中的排队系统
Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3457567
Kfir Lev-Ari, Yizuo Tian, A. Shraer, C. Douglas, Hao Fu, Andrey Andreev, Kevin Beranek, Scott Dugas, Alec Grieser, Jeremy Hemmo
{"title":"QuiCK: A Queuing System in CloudKit","authors":"Kfir Lev-Ari, Yizuo Tian, A. Shraer, C. Douglas, Hao Fu, Andrey Andreev, Kevin Beranek, Scott Dugas, Alec Grieser, Jeremy Hemmo","doi":"10.1145/3448016.3457567","DOIUrl":"https://doi.org/10.1145/3448016.3457567","url":null,"abstract":"We present QuiCK, a queuing system built for managing asynchronous tasks in CloudKit, Apple's storage backend service. QuiCK stores queued messages along with user data in CloudKit, and supports CloudKit's tenancy model including isolation, fair resource allocation, observability, and tenant migration. QuiCK is built on the FoundationDB Record Layer, an open source transactional DBMS. It employs massive two-level sharding, with tens of billions of queues on the first level (separately storing the queued items for each user of every CloudKit app), and hundreds of queues on a second level (one per FoundationDB cluster used by CloudKit). Our evaluation demonstrates that QuiCK scales linearly with additional consumer resources, effectively avoids contention, provides fairness across CloudKit tenants, and executes deferred tasks with low latency.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128482081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Model-Parallel Model Selection for Deep Learning Systems 深度学习系统的模型-并行模型选择
Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3450571
Kabir Nagrecha
{"title":"Model-Parallel Model Selection for Deep Learning Systems","authors":"Kabir Nagrecha","doi":"10.1145/3448016.3450571","DOIUrl":"https://doi.org/10.1145/3448016.3450571","url":null,"abstract":"As deep learning becomes more expensive, both in terms of time and compute, inefficiencies in machine learning training prevent practical usage of state-of-the-art models for most users. The newest model architectures are simply too large to be fit onto a single processor. To address the issue, many ML practitioners have turned to model parallelism as a method of distributing the computational requirements across several devices. Unfortunately, the sequential nature of neural networks causes very low efficiency and device utilization in model parallel training jobs. We propose a new form of \"shard parallelism\" combining task parallelism and model parallelism, and package it into a framework we name Hydra. Hydra recasts the problem of model parallelism in the multi-model context to produce a fine-grained parallel workload of independent model shards, rather than independent models. This new parallel design promises dramatic speedups relative to the traditional model parallelism paradigm.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128383522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Not your Grandpa's SSD: The Era of Co-Designed Storage Devices 不是你爷爷的SSD:共同设计存储设备的时代
Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3457540
Alberto Lerner, Philippe Bonnet
{"title":"Not your Grandpa's SSD: The Era of Co-Designed Storage Devices","authors":"Alberto Lerner, Philippe Bonnet","doi":"10.1145/3448016.3457540","DOIUrl":"https://doi.org/10.1145/3448016.3457540","url":null,"abstract":"Gone is the time when a Solid-State Drive (SSD) was just a fast drop-in replacement for a Hard-Disk Drive (HDD). Thanks to the NVMe ecosystem, nowadays, SSDs are accessed through specific interfaces and modern I/O frameworks. SSDs have also grown versatile with time and can now support various use cases ranging from cold, high-density storage to hot, low-latency ones. The body of knowledge about building such different devices is mostly available, but it is less than accessible to non-experts. Finding which device variation can better support a given workload also requires deep domain knowledge. This tutorial's first goal is to make these tasks--understanding the design of SSDs and pairing them with the data-intensive workloads they support well--more inviting. The tutorial goes further, however, in that it suggests that a new kind of SSD plays an essential role in post-Moore computer systems. These devices can be co-designed to align their capabilities to an application's requirements. A salient feature of these devices is that they can run application logic besides just storing data. They can thus gracefully scale processing capabilities with the volume of data stored. The tutorial's second goal is thus to establish the design space for co-designed SSDs and show the tools available to hardware, systems, and databases researchers that wish to explore this space.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124563207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Versatile Equivalences: Speeding up Subgraph Query Processing and Subgraph Matching 通用等价:加速子图查询处理和子图匹配
Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3457265
Hyunjoon Kim, Yunyoung Choi, Kunsoo Park, Xuemin Lin, Seok-Hee Hong, Wook-Shin Han
{"title":"Versatile Equivalences: Speeding up Subgraph Query Processing and Subgraph Matching","authors":"Hyunjoon Kim, Yunyoung Choi, Kunsoo Park, Xuemin Lin, Seok-Hee Hong, Wook-Shin Han","doi":"10.1145/3448016.3457265","DOIUrl":"https://doi.org/10.1145/3448016.3457265","url":null,"abstract":"Subgraph query processing (also known as subgraph search) and subgraph matching are fundamental graph problems in many application domains. A lot of efforts have been made to develop practical solutions for these problems. Despite the efforts, existing algorithms showed limited running time and scalability in dealing with large and/or many graphs. In this paper, we propose a new subgraph search algorithm using equivalences of vertices in order to reduce search space: (1) static equivalence of vertices in a query graph that leads to an efficient matching order of the vertices, and (2) dynamic equivalence of candidate vertices in a data graph, which enables us to capture and remove redundancies in search space. These techniques for subgraph search also lead to an improved algorithm for subgraph matching. Experiments show that our approach outperforms state-of-the-art subgraph search and subgraph matching algorithms by up to several orders of magnitude with respect to query processing time.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123309853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
PyExplore: Query Recommendations for Data Exploration without Query Logs PyExplore:没有查询日志的数据探索的查询建议
Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3452762
Apostolos Glenis, G. Koutrika
{"title":"PyExplore: Query Recommendations for Data Exploration without Query Logs","authors":"Apostolos Glenis, G. Koutrika","doi":"10.1145/3448016.3452762","DOIUrl":"https://doi.org/10.1145/3448016.3452762","url":null,"abstract":"Helping users explore data becomes increasingly more important as databases get larger and more complex. In this demo, we present PyExplore, a data exploration tool aimed at helping end users formulate queries over new datasets. PyExplore takes as input an initial query from the user along with some parameters and provides interesting queries by leveraging data correlations and diversity.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121390902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Grouped Learning: Group-By Model Selection Workloads 分组学习:分组模型选择工作量
Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3450576
Side Li
{"title":"Grouped Learning: Group-By Model Selection Workloads","authors":"Side Li","doi":"10.1145/3448016.3450576","DOIUrl":"https://doi.org/10.1145/3448016.3450576","url":null,"abstract":"Machine Learning (ML) is gaining popularity in many applications. Increasingly, companies prefer more targeted models for different subgroups of the population like locations, which helps improve accuracy. This practice is comparable to Group-By aggregation in SQL; we call it learning over groups. A smaller group means the data distribution is more straightforward than the whole population. So, a group-level model may offer more accuracy in many cases. Non-technical business needs, such as privacy and regulatory compliance, may also necessitate group-level models. For instance, online advertising platforms would need to build disaggregated partner-specific ML models, where all partner groups' training data are aggregated together in one data pipeline.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124093627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TardisDB
Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3452767
Maximilian E. Schüle, Josef Schmeißer, T. Blum, Alfons Kemper, Thomas Neumann
{"title":"TardisDB","authors":"Maximilian E. Schüle, Josef Schmeißer, T. Blum, Alfons Kemper, Thomas Neumann","doi":"10.1145/3448016.3452767","DOIUrl":"https://doi.org/10.1145/3448016.3452767","url":null,"abstract":"Online encyclopaedias such as Wikipedia implement their own version control above database systems to manage multiple revisions of the same page. In contrast to temporal databases that restrict each tuple's validity to a time range, a version affects multiple tuples. To overcome the need for a separate version layer, we have created TardisDB, the first database system with incorporated data versioning across multiple relations. This paper presents the interface for TardisDB with an extended SQL to manage and query data from different branches. We first give an overview of TardisDB's architecture that includes an extended table scan operator: a branch bitmap indicates a tuple's affiliation to a branch and a chain of tuples tracks the different versions. This is the first database system that combines chains for multiversion concurrency control with a bitmap for each branch to enable versioning. Afterwards, we describe our proposed SQL extension to create, query and modify tables across different, named branches. In our demonstration setup, we allow users to interactively create and edit branches and display the lineage of each branch.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115740928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
ResTune: Resource Oriented Tuning Boosted by Meta-Learning for Cloud Databases retune:基于云数据库元学习的面向资源调优
Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3457291
Xinyi Zhang, Hong Wu, Zhuonan Chang, Shuowei Jin, Jian Tan, Feifei Li, Tieying Zhang, Bin Cui
{"title":"ResTune: Resource Oriented Tuning Boosted by Meta-Learning for Cloud Databases","authors":"Xinyi Zhang, Hong Wu, Zhuonan Chang, Shuowei Jin, Jian Tan, Feifei Li, Tieying Zhang, Bin Cui","doi":"10.1145/3448016.3457291","DOIUrl":"https://doi.org/10.1145/3448016.3457291","url":null,"abstract":"Modern database management systems (DBMS) contain tens to hundreds of critical performance tuning knobs that determine the system runtime behaviors. To reduce the total cost of ownership, cloud database providers put in drastic effort to automatically optimize the resource utilization by tuning these knobs. There are two challenges. First, the tuning system should always abide by the service level agreement (SLA) while optimizing the resource utilization, which imposes strict constrains on the tuning process. Second, the tuning time should be reasonably acceptable since time-consuming tuning is not practical for production and online troubleshooting. In this paper, we design ResTune to automatically optimize the resource utilization without violating SLA constraints on the throughput and latency requirements. ResTune leverages the tuning experience from the history tasks and transfers the accumulated knowledge to accelerate the tuning process of the new tasks. The prior knowledge is represented from historical tuning tasks through an ensemble model. The model learns the similarity between the historical workloads and the target, which significantly reduces the tuning time by a meta-learning based approach. ResTune can efficiently handle different workloads and various hardware environments. We perform evaluations using benchmarks and real world workloads on different types of resources. The results show that, compared with the manually tuned configurations, ResTune reduces 65%, 87%, 39% of CPU utilization, I/O and memory on average, respectively. Compared with the state-of-the-art methods, ResTune finds better configurations with up to ~18x speedups.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130029940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear Algebra 分布式机器学习和线性代数中矩阵实现的自动优化
Proceedings of the 2021 International Conference on Management of Data Pub Date : 2021-06-09 DOI: 10.1145/3448016.3457317
Shangyu Luo, Dimitrije Jankov, Binhang Yuan, C. Jermaine
{"title":"Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear Algebra","authors":"Shangyu Luo, Dimitrije Jankov, Binhang Yuan, C. Jermaine","doi":"10.1145/3448016.3457317","DOIUrl":"https://doi.org/10.1145/3448016.3457317","url":null,"abstract":"Machine learning (ML) computations are often expressed using vectors, matrices, or higher-dimensional tensors. Such data structures can have many different implementations, especially in a distributed environment: a matrix could be stored as row or column vectors, tiles of different sizes, or relationally, as a set of (rowIndex, colIndex, value) triples. Many other storage formats are possible. The choice of format can have a profound impact on the performance of a ML computation. In this paper, we propose a framework for automatic optimization of the physical implementation of a complex ML or linear algebra (LA) computation in a distributed environment, develop algorithms for solving this problem, and show, through a prototype on top of a distributed relational database system, that our ideas can radically speed up common ML and LA computations.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134449711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信