块采样:MapReduce中高效准确的在线聚合

2013 IEEE 5th International Conference on Cloud Computing Technology and Science Pub Date : 2013-12-02 DOI:10.1109/CloudCom.2013.40

Vasiliki Kalavri, V. Brundza, Vladimir Vlassov

{"title":"块采样:MapReduce中高效准确的在线聚合","authors":"Vasiliki Kalavri, V. Brundza, Vladimir Vlassov","doi":"10.1109/CloudCom.2013.40","DOIUrl":null,"url":null,"abstract":"Large-scale data processing frameworks, such as Hadoop MapReduce, are widely used to analyze enormous amounts of data. However, processing is often time-consuming, preventing interactive analysis. One way to decrease response time is partial job execution, where an approximate, early result becomes available to the user, prior to job completion. The Hadoop Online Prototype (HOP) uses online aggregation to provide early results, by partially executing jobs on subsets of the input, using a simplistic progress metric. Due to its sequential nature, values are not objectively represented in the input subset, often resulting in poor approximations or \"data bias\". In this paper, we propose a block sampling technique for large-scale data processing, which can be used for fast and accurate partial job execution. Our implementation of the technique on top of HOP uniformly samples HDFS blocks and uses in-memory shuffling to reduce data bias. Our prototype significantly improves the accuracy of HOP's early results, while only introducing minimal overhead. We evaluate our technique using real-world datasets and applications and demonstrate that our system outperforms HOP in terms of accuracy. In particular, when estimating the average temperature of the studied dataset, our system provides high accuracy (less than 20% absolute error) after processing only 10% of the input, while HOP needs to process 70% of the input to yield comparable results.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"116 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Block Sampling: Efficient Accurate Online Aggregation in MapReduce\",\"authors\":\"Vasiliki Kalavri, V. Brundza, Vladimir Vlassov\",\"doi\":\"10.1109/CloudCom.2013.40\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale data processing frameworks, such as Hadoop MapReduce, are widely used to analyze enormous amounts of data. However, processing is often time-consuming, preventing interactive analysis. One way to decrease response time is partial job execution, where an approximate, early result becomes available to the user, prior to job completion. The Hadoop Online Prototype (HOP) uses online aggregation to provide early results, by partially executing jobs on subsets of the input, using a simplistic progress metric. Due to its sequential nature, values are not objectively represented in the input subset, often resulting in poor approximations or \\\"data bias\\\". In this paper, we propose a block sampling technique for large-scale data processing, which can be used for fast and accurate partial job execution. Our implementation of the technique on top of HOP uniformly samples HDFS blocks and uses in-memory shuffling to reduce data bias. Our prototype significantly improves the accuracy of HOP's early results, while only introducing minimal overhead. We evaluate our technique using real-world datasets and applications and demonstrate that our system outperforms HOP in terms of accuracy. In particular, when estimating the average temperature of the studied dataset, our system provides high accuracy (less than 20% absolute error) after processing only 10% of the input, while HOP needs to process 70% of the input to yield comparable results.\",\"PeriodicalId\":198053,\"journal\":{\"name\":\"2013 IEEE 5th International Conference on Cloud Computing Technology and Science\",\"volume\":\"116 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 5th International Conference on Cloud Computing Technology and Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CloudCom.2013.40\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudCom.2013.40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

摘要

大规模数据处理框架，如Hadoop MapReduce，被广泛用于分析海量数据。然而，处理过程通常很耗时，妨碍了交互式分析。减少响应时间的一种方法是部分作业执行，在作业完成之前，用户可以获得一个近似的早期结果。Hadoop在线原型(HOP)使用简单的进度度量，通过在输入的子集上部分执行作业，使用在线聚合来提供早期结果。由于其序列性质，值不能客观地在输入子集中表示，通常会导致较差的近似值或“数据偏差”。在本文中，我们提出了一种用于大规模数据处理的块采样技术，该技术可用于快速准确地执行部分作业。我们在HOP之上的技术实现统一采样HDFS块，并使用内存洗牌来减少数据偏差。我们的原型显著提高了HOP早期结果的准确性，同时只引入了最小的开销。我们使用真实世界的数据集和应用程序来评估我们的技术，并证明我们的系统在准确性方面优于HOP。特别是，在估计研究数据集的平均温度时，我们的系统在处理10%的输入后提供了很高的精度(绝对误差小于20%)，而HOP需要处理70%的输入才能产生可比的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Block Sampling: Efficient Accurate Online Aggregation in MapReduce

Large-scale data processing frameworks, such as Hadoop MapReduce, are widely used to analyze enormous amounts of data. However, processing is often time-consuming, preventing interactive analysis. One way to decrease response time is partial job execution, where an approximate, early result becomes available to the user, prior to job completion. The Hadoop Online Prototype (HOP) uses online aggregation to provide early results, by partially executing jobs on subsets of the input, using a simplistic progress metric. Due to its sequential nature, values are not objectively represented in the input subset, often resulting in poor approximations or "data bias". In this paper, we propose a block sampling technique for large-scale data processing, which can be used for fast and accurate partial job execution. Our implementation of the technique on top of HOP uniformly samples HDFS blocks and uses in-memory shuffling to reduce data bias. Our prototype significantly improves the accuracy of HOP's early results, while only introducing minimal overhead. We evaluate our technique using real-world datasets and applications and demonstrate that our system outperforms HOP in terms of accuracy. In particular, when estimating the average temperature of the studied dataset, our system provides high accuracy (less than 20% absolute error) after processing only 10% of the input, while HOP needs to process 70% of the input to yield comparable results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE 5th International Conference on Cloud Computing Technology and Science

自引率

0.00%

发文量