块采样:MapReduce中高效准确的在线聚合

Vasiliki Kalavri, V. Brundza, Vladimir Vlassov
{"title":"块采样:MapReduce中高效准确的在线聚合","authors":"Vasiliki Kalavri, V. Brundza, Vladimir Vlassov","doi":"10.1109/CloudCom.2013.40","DOIUrl":null,"url":null,"abstract":"Large-scale data processing frameworks, such as Hadoop MapReduce, are widely used to analyze enormous amounts of data. However, processing is often time-consuming, preventing interactive analysis. One way to decrease response time is partial job execution, where an approximate, early result becomes available to the user, prior to job completion. The Hadoop Online Prototype (HOP) uses online aggregation to provide early results, by partially executing jobs on subsets of the input, using a simplistic progress metric. Due to its sequential nature, values are not objectively represented in the input subset, often resulting in poor approximations or \"data bias\". In this paper, we propose a block sampling technique for large-scale data processing, which can be used for fast and accurate partial job execution. Our implementation of the technique on top of HOP uniformly samples HDFS blocks and uses in-memory shuffling to reduce data bias. Our prototype significantly improves the accuracy of HOP's early results, while only introducing minimal overhead. We evaluate our technique using real-world datasets and applications and demonstrate that our system outperforms HOP in terms of accuracy. In particular, when estimating the average temperature of the studied dataset, our system provides high accuracy (less than 20% absolute error) after processing only 10% of the input, while HOP needs to process 70% of the input to yield comparable results.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"116 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Block Sampling: Efficient Accurate Online Aggregation in MapReduce\",\"authors\":\"Vasiliki Kalavri, V. Brundza, Vladimir Vlassov\",\"doi\":\"10.1109/CloudCom.2013.40\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale data processing frameworks, such as Hadoop MapReduce, are widely used to analyze enormous amounts of data. However, processing is often time-consuming, preventing interactive analysis. One way to decrease response time is partial job execution, where an approximate, early result becomes available to the user, prior to job completion. The Hadoop Online Prototype (HOP) uses online aggregation to provide early results, by partially executing jobs on subsets of the input, using a simplistic progress metric. Due to its sequential nature, values are not objectively represented in the input subset, often resulting in poor approximations or \\\"data bias\\\". In this paper, we propose a block sampling technique for large-scale data processing, which can be used for fast and accurate partial job execution. Our implementation of the technique on top of HOP uniformly samples HDFS blocks and uses in-memory shuffling to reduce data bias. Our prototype significantly improves the accuracy of HOP's early results, while only introducing minimal overhead. We evaluate our technique using real-world datasets and applications and demonstrate that our system outperforms HOP in terms of accuracy. In particular, when estimating the average temperature of the studied dataset, our system provides high accuracy (less than 20% absolute error) after processing only 10% of the input, while HOP needs to process 70% of the input to yield comparable results.\",\"PeriodicalId\":198053,\"journal\":{\"name\":\"2013 IEEE 5th International Conference on Cloud Computing Technology and Science\",\"volume\":\"116 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 5th International Conference on Cloud Computing Technology and Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CloudCom.2013.40\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudCom.2013.40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18

摘要

大规模数据处理框架,如Hadoop MapReduce,被广泛用于分析海量数据。然而,处理过程通常很耗时,妨碍了交互式分析。减少响应时间的一种方法是部分作业执行,在作业完成之前,用户可以获得一个近似的早期结果。Hadoop在线原型(HOP)使用简单的进度度量,通过在输入的子集上部分执行作业,使用在线聚合来提供早期结果。由于其序列性质,值不能客观地在输入子集中表示,通常会导致较差的近似值或“数据偏差”。在本文中,我们提出了一种用于大规模数据处理的块采样技术,该技术可用于快速准确地执行部分作业。我们在HOP之上的技术实现统一采样HDFS块,并使用内存洗牌来减少数据偏差。我们的原型显著提高了HOP早期结果的准确性,同时只引入了最小的开销。我们使用真实世界的数据集和应用程序来评估我们的技术,并证明我们的系统在准确性方面优于HOP。特别是,在估计研究数据集的平均温度时,我们的系统在处理10%的输入后提供了很高的精度(绝对误差小于20%),而HOP需要处理70%的输入才能产生可比的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Large-scale data processing frameworks, such as Hadoop MapReduce, are widely used to analyze enormous amounts of data. However, processing is often time-consuming, preventing interactive analysis. One way to decrease response time is partial job execution, where an approximate, early result becomes available to the user, prior to job completion. The Hadoop Online Prototype (HOP) uses online aggregation to provide early results, by partially executing jobs on subsets of the input, using a simplistic progress metric. Due to its sequential nature, values are not objectively represented in the input subset, often resulting in poor approximations or "data bias". In this paper, we propose a block sampling technique for large-scale data processing, which can be used for fast and accurate partial job execution. Our implementation of the technique on top of HOP uniformly samples HDFS blocks and uses in-memory shuffling to reduce data bias. Our prototype significantly improves the accuracy of HOP's early results, while only introducing minimal overhead. We evaluate our technique using real-world datasets and applications and demonstrate that our system outperforms HOP in terms of accuracy. In particular, when estimating the average temperature of the studied dataset, our system provides high accuracy (less than 20% absolute error) after processing only 10% of the input, while HOP needs to process 70% of the input to yield comparable results.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信