Managing Tiny Tasks for Data-Parallel, Subsampling Workloads

Sundeep Kambhampati, Jaimie Kelley, Christopher Stewart, William C. L. Stewart, R. Ramnath
{"title":"Managing Tiny Tasks for Data-Parallel, Subsampling Workloads","authors":"Sundeep Kambhampati, Jaimie Kelley, Christopher Stewart, William C. L. Stewart, R. Ramnath","doi":"10.1109/IC2E.2014.94","DOIUrl":null,"url":null,"abstract":"Subsampling workloads compute statistics from a set of observed samples using a random subset of sample data (i.e., a subsample). Data-parallel platforms group these samples into tasks, each task subsamples its data in parallel. In this paper, we study subsampling workloads that benefit from tiny tasks-i.e., tasks comprising few samples. Tiny tasks reduce processor cache misses caused by random subsampling, which speeds up per-task running time. However, they can also cause significant scheduling overheads that negate the time reduction from reduced cache misses. For example, vanilla Hadoop takes longer to start tiny tasks than to run them. We compared the task scheduling overheads of vanilla Hadoop, lightweight Hadoop setups, and BashReduce. BashReduce, the best platform, outperformed the worst by 3.6X but scheduling overhead was still 12% of a task's running time. We improved BashReduce's scheduler by allowing it to size tasks according to kneepoints on the miss rate curve. We tested these changes on high-throughput genotype data and on data obtained from Netflix. Our improved BashReduce outperformed vanilla Hadoop by almost 3X and completed short, interactive jobs almost as efficiently as long jobs. These results held at scale and across diverse, heterogeneous hardware.","PeriodicalId":273902,"journal":{"name":"2014 IEEE International Conference on Cloud Engineering","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Cloud Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC2E.2014.94","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Subsampling workloads compute statistics from a set of observed samples using a random subset of sample data (i.e., a subsample). Data-parallel platforms group these samples into tasks, each task subsamples its data in parallel. In this paper, we study subsampling workloads that benefit from tiny tasks-i.e., tasks comprising few samples. Tiny tasks reduce processor cache misses caused by random subsampling, which speeds up per-task running time. However, they can also cause significant scheduling overheads that negate the time reduction from reduced cache misses. For example, vanilla Hadoop takes longer to start tiny tasks than to run them. We compared the task scheduling overheads of vanilla Hadoop, lightweight Hadoop setups, and BashReduce. BashReduce, the best platform, outperformed the worst by 3.6X but scheduling overhead was still 12% of a task's running time. We improved BashReduce's scheduler by allowing it to size tasks according to kneepoints on the miss rate curve. We tested these changes on high-throughput genotype data and on data obtained from Netflix. Our improved BashReduce outperformed vanilla Hadoop by almost 3X and completed short, interactive jobs almost as efficiently as long jobs. These results held at scale and across diverse, heterogeneous hardware.
管理数据并行、子采样工作负载的小任务
子采样工作负载使用样本数据的随机子集(即子样本)从一组观察到的样本计算统计信息。数据并行平台将这些样本分组为任务,每个任务并行地对其数据进行子采样。在本文中,我们研究了受益于微小任务的子采样工作负载。,包含少量样本的任务。小任务减少了随机子抽样导致的处理器缓存丢失,从而加快了每个任务的运行时间。但是,它们也可能导致大量的调度开销,从而抵消了减少缓存丢失所带来的时间减少。例如,普通Hadoop启动小任务的时间要比运行它们的时间长。我们比较了普通Hadoop、轻量级Hadoop设置和BashReduce的任务调度开销。最好的平台BashReduce的性能比最差的平台高出3.6倍,但调度开销仍然是任务运行时间的12%。我们改进了BashReduce的调度器,允许它根据缺失率曲线上的结点来调整任务大小。我们在高通量基因型数据和从Netflix获得的数据上测试了这些变化。我们改进的BashReduce的性能比普通Hadoop高出近3倍,完成简短的交互式任务几乎和完成长任务一样高效。这些结果适用于规模和不同的异构硬件。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信