Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data最新文献_第2页

Spark SQL: Relational Data Processing in Spark Spark SQL:关系型数据处理

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2742797

Michael Armbrust, Reynold Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, M. Franklin, A. Ghodsi, M. Zaharia

{"title":"Spark SQL: Relational Data Processing in Spark","authors":"Michael Armbrust, Reynold Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, M. Franklin, A. Ghodsi, M. Zaharia","doi":"10.1145/2723372.2742797","DOIUrl":"https://doi.org/10.1145/2723372.2742797","url":null,"abstract":"Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123895941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1296

Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database 动态分析:同一数据库中的高性能事件处理和实时分析

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2742783

Lucas Braun, Thomas Etter, Georgios Gasparis, Martin Kaufmann, Donald Kossmann, Daniel Widmer, Aharon Avitzur, A. Iliopoulos, Eliezer Levy, Ning Liang

{"title":"Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database","authors":"Lucas Braun, Thomas Etter, Georgios Gasparis, Martin Kaufmann, Donald Kossmann, Daniel Widmer, Aharon Avitzur, A. Iliopoulos, Eliezer Levy, Ning Liang","doi":"10.1145/2723372.2742783","DOIUrl":"https://doi.org/10.1145/2723372.2742783","url":null,"abstract":"Modern data-centric flows in the telecommunications industry require real time analytical processing over a rapidly changing and large dataset. The traditional approach of separating OLTP and OLAP workloads cannot satisfy this requirement. Instead, a new class of integrated solutions for handling hybrid workloads is needed. This paper presents an industrial use case and a novel architecture that integrates key-value-based event processing and SQL-based analytical processing on the same distributed store while minimizing the total cost of ownership. Our approach combines several well-known techniques such as shared scans, delta processing, a PAX-fashioned storage layout, and an interleaving of scanning and delta merging in a completely new way. Performance experiments show that our system scales out linearly with the number of servers. For instance, our system sustains event streams of 100,000 events per second while simultaneously processing 100 ad-hoc analytical queries per second, using a cluster of 12 commodity servers. In doing so, our system meets all response time goals of our telecommunication customers; that is, 10 milliseconds per event and 100 milliseconds for an ad-hoc analytical query. Moreover, our system beats commercial competitors by a factor of 2.5 in analytical and two orders of magnitude in update performance.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124042735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

iCrowd: An Adaptive Crowdsourcing Framework iccrowd:一个适应性众包框架

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2750550

Ju Fan, Guoliang Li, B. Ooi, K. Tan, Jianhua Feng

{"title":"iCrowd: An Adaptive Crowdsourcing Framework","authors":"Ju Fan, Guoliang Li, B. Ooi, K. Tan, Jianhua Feng","doi":"10.1145/2723372.2750550","DOIUrl":"https://doi.org/10.1145/2723372.2750550","url":null,"abstract":"Crowdsourcing is widely accepted as a means for resolving tasks that machines are not good at. Unfortunately, Crowdsourcing may yield relatively low-quality results if there is no proper quality control. Although previous studies attempt to eliminate \"bad\" workers by using qualification tests, the accuracies estimated from qualifications may not be accurate, because workers have diverse accuracies across tasks. Thus, the quality of the results could be further improved by selectively assigning tasks to the workers who are well acquainted with the tasks. To this end, we propose an adaptive crowdsourcing framework, called iCrowd. iCrowd on-the-fly estimates accuracies of a worker by evaluating her performance on the completed tasks, and predicts which tasks the worker is well acquainted with. When a worker requests for a task, iCrowd assigns her a task, to which the worker has the highest estimated accuracy among all online workers. Once a worker submits an answer to a task, iCrowd analyzes her answer and adjusts estimation of her accuracies to improve subsequent task assignments. This paper studies the challenges that arise in iCrowd. The first is how to estimate diverse accuracies of a worker based on her completed tasks. The second is instant task assignment. We deploy iCrowd on Amazon Mechanical Turk, and conduct extensive experiments on real datasets. Experimental results show that iCrowd achieves higher quality than existing approaches.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126548900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 189

Crowd-Based Deduplication: An Adaptive Approach 基于人群的重复数据删除:一种自适应方法

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2723739

Sibo Wang, Xiaokui Xiao, Chun-Hee Lee

引用次数: 71

Three Favorite Results 三个最受欢迎的结果

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2753770

J. Widom

引用次数: 0

Telco Churn Prediction with Big Data 利用大数据预测电信客户流失

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2742794

Yiqing Huang, Fangzhou Zhu, Mingxuan Yuan, K. Deng, Yanhua Li, Bing Ni, Wenyuan Dai, Qiang Yang, Jia Zeng

引用次数: 110

Private Release of Graph Statistics using Ladder Functions 使用阶梯函数的图形统计私有发布

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2737785

Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, D. Srivastava, Xiaokui Xiao

{"title":"Private Release of Graph Statistics using Ladder Functions","authors":"Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, D. Srivastava, Xiaokui Xiao","doi":"10.1145/2723372.2737785","DOIUrl":"https://doi.org/10.1145/2723372.2737785","url":null,"abstract":"Protecting the privacy of individuals in graph structured data while making accurate versions of the data available is one of the most challenging problems in data privacy. Most efforts to date to perform this data release end up mired in complexity, overwhelm the signal with noise, and are not effective for use in practice. In this paper, we introduce a new method which guarantees differential privacy. It specifies a probability distribution over possible outputs that is carefully defined to maximize the utility for the given input, while still providing the required privacy level. The distribution is designed to form a 'ladder', so that each output achieves the highest 'rung' (maximum probability) compared to less preferable outputs. We show how our ladder framework can be applied to problems of counting the number of occurrences of subgraphs, a vital objective in graph analysis, and give algorithms whose cost is comparable to that of computing the count exactly. Our experimental study confirms that our method outperforms existing methods for counting triangles and stars in terms of accuracy, and provides solutions for some problems for which no effective method was previously known. The results of our algorithms can be used to estimate the parameters of suitable graph models, allowing synthetic graphs to be sampled.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115107980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 92

DunceCap: Query Plans Using Generalized Hypertree Decompositions DunceCap:使用广义超树分解的查询计划

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2764946

Susan Tu, C. Ré

引用次数: 26

Locality-aware Partitioning in Parallel Database Systems 并行数据库系统中的位置感知分区

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2723718

Erfan Zamanian, Carsten Binnig, Abdallah Salama

{"title":"Locality-aware Partitioning in Parallel Database Systems","authors":"Erfan Zamanian, Carsten Binnig, Abdallah Salama","doi":"10.1145/2723372.2723718","DOIUrl":"https://doi.org/10.1145/2723372.2723718","url":null,"abstract":"Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in shared-nothing clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations. However, existing partitioning schemes are limited in that respect since only subsets of tables in complex schemata sharing the same join key can be co-partitioned unless tables are fully replicated. In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates. Moreover, based on PREF, we present two automatic partitioning design algorithms to maximize data-locality. One algorithm only needs the schema and data whereas the other algorithm additionally takes the workload as input. In our experiments we show that our automated design algorithms can partition database schemata of different complexity and thus help to effectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124486070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

The LDBC Social Network Benchmark: Interactive Workload LDBC社会网络基准:交互式工作负载

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2742786

O. Erling, A. Averbuch, J. Larriba-Pey, Hassan Chafi, Andrey Gubichev, Arnau Prat-Pérez, M. Pham, P. Boncz

引用次数: 269