Managing Tiny Tasks for Data-Parallel, Subsampling Workloads

2014 IEEE International Conference on Cloud Engineering Pub Date : 2014-03-11 DOI:10.1109/IC2E.2014.94

Sundeep Kambhampati, Jaimie Kelley, Christopher Stewart, William C. L. Stewart, R. Ramnath

{"title":"Managing Tiny Tasks for Data-Parallel, Subsampling Workloads","authors":"Sundeep Kambhampati, Jaimie Kelley, Christopher Stewart, William C. L. Stewart, R. Ramnath","doi":"10.1109/IC2E.2014.94","DOIUrl":null,"url":null,"abstract":"Subsampling workloads compute statistics from a set of observed samples using a random subset of sample data (i.e., a subsample). Data-parallel platforms group these samples into tasks, each task subsamples its data in parallel. In this paper, we study subsampling workloads that benefit from tiny tasks-i.e., tasks comprising few samples. Tiny tasks reduce processor cache misses caused by random subsampling, which speeds up per-task running time. However, they can also cause significant scheduling overheads that negate the time reduction from reduced cache misses. For example, vanilla Hadoop takes longer to start tiny tasks than to run them. We compared the task scheduling overheads of vanilla Hadoop, lightweight Hadoop setups, and BashReduce. BashReduce, the best platform, outperformed the worst by 3.6X but scheduling overhead was still 12% of a task's running time. We improved BashReduce's scheduler by allowing it to size tasks according to kneepoints on the miss rate curve. We tested these changes on high-throughput genotype data and on data obtained from Netflix. Our improved BashReduce outperformed vanilla Hadoop by almost 3X and completed short, interactive jobs almost as efficiently as long jobs. These results held at scale and across diverse, heterogeneous hardware.","PeriodicalId":273902,"journal":{"name":"2014 IEEE International Conference on Cloud Engineering","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Cloud Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC2E.2014.94","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Subsampling workloads compute statistics from a set of observed samples using a random subset of sample data (i.e., a subsample). Data-parallel platforms group these samples into tasks, each task subsamples its data in parallel. In this paper, we study subsampling workloads that benefit from tiny tasks-i.e., tasks comprising few samples. Tiny tasks reduce processor cache misses caused by random subsampling, which speeds up per-task running time. However, they can also cause significant scheduling overheads that negate the time reduction from reduced cache misses. For example, vanilla Hadoop takes longer to start tiny tasks than to run them. We compared the task scheduling overheads of vanilla Hadoop, lightweight Hadoop setups, and BashReduce. BashReduce, the best platform, outperformed the worst by 3.6X but scheduling overhead was still 12% of a task's running time. We improved BashReduce's scheduler by allowing it to size tasks according to kneepoints on the miss rate curve. We tested these changes on high-throughput genotype data and on data obtained from Netflix. Our improved BashReduce outperformed vanilla Hadoop by almost 3X and completed short, interactive jobs almost as efficiently as long jobs. These results held at scale and across diverse, heterogeneous hardware.

查看原文本刊更多论文

管理数据并行、子采样工作负载的小任务

子采样工作负载使用样本数据的随机子集(即子样本)从一组观察到的样本计算统计信息。数据并行平台将这些样本分组为任务，每个任务并行地对其数据进行子采样。在本文中，我们研究了受益于微小任务的子采样工作负载。，包含少量样本的任务。小任务减少了随机子抽样导致的处理器缓存丢失，从而加快了每个任务的运行时间。但是，它们也可能导致大量的调度开销，从而抵消了减少缓存丢失所带来的时间减少。例如，普通Hadoop启动小任务的时间要比运行它们的时间长。我们比较了普通Hadoop、轻量级Hadoop设置和BashReduce的任务调度开销。最好的平台BashReduce的性能比最差的平台高出3.6倍，但调度开销仍然是任务运行时间的12%。我们改进了BashReduce的调度器，允许它根据缺失率曲线上的结点来调整任务大小。我们在高通量基因型数据和从Netflix获得的数据上测试了这些变化。我们改进的BashReduce的性能比普通Hadoop高出近3倍，完成简短的交互式任务几乎和完成长任务一样高效。这些结果适用于规模和不同的异构硬件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE International Conference on Cloud Engineering

自引率

0.00%

发文量