Extending Map-Reduce for Efficient Predicate-Based Sampling

Raman Grover, M. Carey
{"title":"Extending Map-Reduce for Efficient Predicate-Based Sampling","authors":"Raman Grover, M. Carey","doi":"10.1109/ICDE.2012.104","DOIUrl":null,"url":null,"abstract":"In this paper we address the problem of using MapReduce to sample a massive data set in order to produce a fixed-size sample whose contents satisfy a given predicate. While it is simple to express this computation using MapReduce, its default Hadoop execution is dependent on the input size and is wasteful of cluster resources. This is unfortunate, as sampling queries are fairly common (e.g., for exploratory data analysis at Facebook), and the resulting waste can significantly impact the performance of a shared cluster. To address such use cases, we present the design, implementation and evaluation of a Hadoop execution model extension that supports incremental job expansion. Under this model, a job consumes input as required and can dynamically govern its resource consumption while producing the required results. The proposed mechanism is able to support a variety of policies regarding job growth rates as they relate to cluster capacity and current load. We have implemented the mechanism in Hadoop, and we present results from an experimental performance study of different job growth policies under both single- and multi-user workloads.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 28th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2012.104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 56

Abstract

In this paper we address the problem of using MapReduce to sample a massive data set in order to produce a fixed-size sample whose contents satisfy a given predicate. While it is simple to express this computation using MapReduce, its default Hadoop execution is dependent on the input size and is wasteful of cluster resources. This is unfortunate, as sampling queries are fairly common (e.g., for exploratory data analysis at Facebook), and the resulting waste can significantly impact the performance of a shared cluster. To address such use cases, we present the design, implementation and evaluation of a Hadoop execution model extension that supports incremental job expansion. Under this model, a job consumes input as required and can dynamically govern its resource consumption while producing the required results. The proposed mechanism is able to support a variety of policies regarding job growth rates as they relate to cluster capacity and current load. We have implemented the mechanism in Hadoop, and we present results from an experimental performance study of different job growth policies under both single- and multi-user workloads.
基于谓词的高效采样扩展Map-Reduce
在本文中,我们解决了使用MapReduce对大量数据集进行采样的问题,以便生成固定大小的样本,其内容满足给定的谓词。虽然使用MapReduce表示这种计算很简单,但它的默认Hadoop执行依赖于输入大小,并且浪费集群资源。这是不幸的,因为抽样查询是相当常见的(例如,用于Facebook的探索性数据分析),并且由此产生的浪费会严重影响共享集群的性能。为了解决这样的用例,我们提出了一个支持增量作业扩展的Hadoop执行模型扩展的设计、实现和评估。在此模型下,作业根据需要消耗输入,并且可以在生成所需结果的同时动态地控制其资源消耗。所提议的机制能够支持与就业增长率相关的各种政策,因为它们与集群容量和当前负载相关。我们已经在Hadoop中实现了这一机制,并展示了在单用户和多用户工作负载下不同工作增长策略的实验性能研究结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信