Proceedings of the 2016 International Conference on Management of Data最新文献_第3页

Introduction to Spark 2.0 for Database Researchers 数据库研究人员Spark 2.0简介

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI: 10.1145/2882903.2912565

Michael Armbrust, Doug Bateman, Reynold Xin, M. Zaharia

{"title":"Introduction to Spark 2.0 for Database Researchers","authors":"Michael Armbrust, Doug Bateman, Reynold Xin, M. Zaharia","doi":"10.1145/2882903.2912565","DOIUrl":"https://doi.org/10.1145/2882903.2912565","url":null,"abstract":"Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics. Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark. Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation. This tutorial is designed for database researchers (graduate students, faculty members, and industrial researchers) interested in a brief hands-on overview of Spark. This tutorial covers the core APIs for using Spark 2.0, including DataFrames, Datasets, SQL, streaming and machine learning pipelines. Each topic includes slide and lecture content along with hands-on use of a Spark cluster through a web-based notebook environment. In addition, we will dive into the engine internals to discuss architectural design choices and their implications in practice. We will guide the audience to \"hack\" Spark by extending its query optimizer to speed up distributed join execution.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88411008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

GPL: A GPU-based Pipelined Query Processing Engine GPL:基于gpu的流水线查询处理引擎

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915224

Johns Paul, Jiong He, Bingsheng He

{"title":"GPL: A GPU-based Pipelined Query Processing Engine","authors":"Johns Paul, Jiong He, Bingsheng He","doi":"10.1145/2882903.2915224","DOIUrl":"https://doi.org/10.1145/2882903.2915224","url":null,"abstract":"Graphics Processing Units (GPUs) have evolved as a powerful query co-processor for main memory On-Line Analytical Processing (OLAP) databases. However, existing GPU-based query processors adopt a kernel-based execution approach which optimizes individual kernels for resource utilization and executes the GPU kernels involved in the query plan one by one. Such a kernel-based approach cannot utilize all GPU resources efficiently due to the resource underutilization of individual kernels and memory ping-pong across kernel executions. In this paper, we propose GPL, a novel pipelined query execution engine to improve the resource utilization of query co-processing on the GPU. Different from the existing kernel-based execution, GPL takes advantage of hardware features of new-generation GPUs including concurrent kernel execution and efficient data communication channel between kernels. We further develop an analytical model to guide the generation of the optimal pipelined query plan. Thus, the tile size of the pipelined query execution can be adapted in a cost-based manner. We evaluate GPL with TPC-H queries on both AMD and NVIDIA GPUs. The experimental results show that 1) the analytical model is able to guide determining the suitable parameter values in pipelined query execution plan, and 2) GPL is able to significantly outperform the state-of-the-art kernel-based query processing approaches, with improvement up to 48%.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90944021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Provenance: On and Behind the Screens 出处:屏幕上和屏幕后

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI: 10.1145/2882903.2912568

Melanie Herschel, Marcel Hlawatsch

{"title":"Provenance: On and Behind the Screens","authors":"Melanie Herschel, Marcel Hlawatsch","doi":"10.1145/2882903.2912568","DOIUrl":"https://doi.org/10.1145/2882903.2912568","url":null,"abstract":"Collecting and processing provenance, i.e., information describing the production process of some end product, is important in various applications, e.g., to assess quality, to ensure reproducibility, or to reinforce trust in the end product. In the past, different types of provenance meta-data have been proposed, each with a different scope. The first part of the proposed tutorial provides an overview and comparison of these different types of provenance. To put provenance to good use, it is essential to be able to interact with and present provenance data in a user-friendly way. Often, users interested in provenance are not necessarily experts in databases or query languages, as they are typically domain experts of the product and production process for which provenance is collected (biologists, journalists, etc.). Furthermore, in some scenarios, it is difficult to use solely queries for analyzing and exploring provenance data. The second part of this tutorial therefore focuses on enabling users to leverage provenance through adapted visualizations. To this end, we will present some fundamental concepts of visualization before we discuss possible visualizations for provenance.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"112 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88910514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Speedup Graph Processing by Graph Ordering 通过图排序加速图处理

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915220

Hao Wei, J. Yu, Can Lu, Xuemin Lin

{"title":"Speedup Graph Processing by Graph Ordering","authors":"Hao Wei, J. Yu, Can Lu, Xuemin Lin","doi":"10.1145/2882903.2915220","DOIUrl":"https://doi.org/10.1145/2882903.2915220","url":null,"abstract":"The CPU cache performance is one of the key issues to efficiency in database systems. It is reported that cache miss latency takes a half of the execution time in database systems. To improve the CPU cache performance, there are studies to support searching including cache-oblivious, and cache-conscious trees. In this paper, we focus on CPU speedup for graph computing in general by reducing the CPU cache miss ratio for different graph algorithms. The approaches dealing with trees are not applicable to graphs which are complex in nature. In this paper, we explore a general approach to speed up CPU computing, in order to further enhance the efficiency of the graph algorithms without changing the graph algorithms (implementations) and the data structures used. That is, we aim at designing a general solution that is not for a specific graph algorithm, neither for a specific data structure. The approach studied in this work is graph ordering, which is to find the optimal permutation among all nodes in a given graph by keeping nodes that will be frequently accessed together locally, to minimize the CPU cache miss ratio. We prove the graph ordering problem is NP-hard, and give a basic algorithm with a bounded approximation. To improve the time complexity of the basic algorithm, we further propose a new algorithm to reduce the time complexity and improve the efficiency with new optimization techniques based on a new data structure. We conducted extensive experiments to evaluate our approach in comparison with other 9 possible graph orderings (such as the one obtained by METIS) using 8 large real graphs and 9 representative graph algorithms. We confirm that our approach can achieve high performance by reducing the CPU cache miss ratios.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"11 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85118617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 132

How to Architect a Query Compiler 如何构建查询编译器

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915244

A. Shaikhha, Yannis Klonatos, L. Parreaux, Lewis Brown, Mohammad Dashti, Christoph E. Koch

引用次数: 81

PerfEnforce Demonstration: Data Analytics with Performance Guarantees perfenforcement演示:具有性能保证的数据分析

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899402

Jennifer Ortiz, Brendan Lee, M. Balazinska

引用次数: 28

Wildfire: Concurrent Blazing Data Ingest and Analytics 野火:并发燃烧数据摄取和分析

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899406

Ronald Barber, Matthew Huras, G. Lohman, C. Mohan, René Müller, Fatma Özcan, H. Pirahesh, Vijayshankar Raman, Richard Sidle, O. Sidorkin, Adam J. Storm, Yuanyuan Tian, Pınar Tözün

引用次数: 16

Searching Web Data using MinHash LSH 使用MinHash LSH搜索Web数据

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI: 10.1145/2882903.2914838

B. Rao, Erkang Zhu

引用次数: 7

Web-based Benchmarks for Forecasting Systems: The ECAST Platform 基于网络的预测系统基准:ECAST平台

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899399

R. Ulbricht, Claudio Hartmann, M. Hahmann, H. Donker, Wolfgang Lehner

引用次数: 5

A Hybrid B+-tree as Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms 基于混合B+树的CPU-GPU异构计算平台内存索引解决方案

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI: 10.1145/2882903.2882918

Amirhesam Shahvarani, H. Jacobsen

引用次数: 38