Extending Map-Reduce for Efficient Predicate-Based Sampling

2012 IEEE 28th International Conference on Data Engineering Pub Date : 2012-04-01 DOI:10.1109/ICDE.2012.104

Raman Grover, M. Carey

{"title":"Extending Map-Reduce for Efficient Predicate-Based Sampling","authors":"Raman Grover, M. Carey","doi":"10.1109/ICDE.2012.104","DOIUrl":null,"url":null,"abstract":"In this paper we address the problem of using MapReduce to sample a massive data set in order to produce a fixed-size sample whose contents satisfy a given predicate. While it is simple to express this computation using MapReduce, its default Hadoop execution is dependent on the input size and is wasteful of cluster resources. This is unfortunate, as sampling queries are fairly common (e.g., for exploratory data analysis at Facebook), and the resulting waste can significantly impact the performance of a shared cluster. To address such use cases, we present the design, implementation and evaluation of a Hadoop execution model extension that supports incremental job expansion. Under this model, a job consumes input as required and can dynamically govern its resource consumption while producing the required results. The proposed mechanism is able to support a variety of policies regarding job growth rates as they relate to cluster capacity and current load. We have implemented the mechanism in Hadoop, and we present results from an experimental performance study of different job growth policies under both single- and multi-user workloads.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 28th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2012.104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 56

Abstract

In this paper we address the problem of using MapReduce to sample a massive data set in order to produce a fixed-size sample whose contents satisfy a given predicate. While it is simple to express this computation using MapReduce, its default Hadoop execution is dependent on the input size and is wasteful of cluster resources. This is unfortunate, as sampling queries are fairly common (e.g., for exploratory data analysis at Facebook), and the resulting waste can significantly impact the performance of a shared cluster. To address such use cases, we present the design, implementation and evaluation of a Hadoop execution model extension that supports incremental job expansion. Under this model, a job consumes input as required and can dynamically govern its resource consumption while producing the required results. The proposed mechanism is able to support a variety of policies regarding job growth rates as they relate to cluster capacity and current load. We have implemented the mechanism in Hadoop, and we present results from an experimental performance study of different job growth policies under both single- and multi-user workloads.

查看原文本刊更多论文

基于谓词的高效采样扩展Map-Reduce

在本文中，我们解决了使用MapReduce对大量数据集进行采样的问题，以便生成固定大小的样本，其内容满足给定的谓词。虽然使用MapReduce表示这种计算很简单，但它的默认Hadoop执行依赖于输入大小，并且浪费集群资源。这是不幸的，因为抽样查询是相当常见的(例如，用于Facebook的探索性数据分析)，并且由此产生的浪费会严重影响共享集群的性能。为了解决这样的用例，我们提出了一个支持增量作业扩展的Hadoop执行模型扩展的设计、实现和评估。在此模型下，作业根据需要消耗输入，并且可以在生成所需结果的同时动态地控制其资源消耗。所提议的机制能够支持与就业增长率相关的各种政策，因为它们与集群容量和当前负载相关。我们已经在Hadoop中实现了这一机制，并展示了在单用户和多用户工作负载下不同工作增长策略的实验性能研究结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 28th International Conference on Data Engineering

自引率

0.00%

发文量